Detokenization is the inverse of tokenization: removal of spaces between tokens when required. For instance, there should not be a space between a word and a period (both are separate tokens) at the end of a sentence.
Tokenized: Hello world ! Detokenized: Hello world!
It is usually the last step in the machine translation pipeline.
It's a decent detokenizer written in Perl and requires no dependences. It comes with the Moses toolkit. It's licensed under the LGPL.
It's a mess of
ifs and regexes but supports a wide range of languages even when the language code is not explicitly mentioned in the code.
Languages mentioned in the code: English, French, Italian, Czech, Finnish, Chinese, Japanese, Korean.
It's a single file Perl script and has no dependencies.
You can grab it from the Moses GitHub repository:
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/detokenizer.perl chmod +x detokenizer.perl
$ echo "Hello world !" | perl detokenizer.perl -q -l en Hello world!
q flag stands for
quiet and will suppress the script version output to stderr.
-q the output is
$ echo "Hello world !" | perl detokenizer.perl -l en Detokenizer Version $Revision: 4134 $ Language: en Hello world!
If you use a language code that is not in the whitelist, you'll get a friendly warning about it:
echo "Hello world !" | perl detokenizer.perl -q -l lv Warning: No built-in rules for language lv. Hello world!
We are going to tokenize the English European Parlament corpus and then try to detokenize it.
First, download the corpus:
wget http://opus.lingfil.uu.se/Europarl/en-es.txt.zip unzip -p en-es.txt.zip Europarl.en-es.en > europarl.en rm en-es.txt.zip wc -l europarl.en* ; du -h europarl.en*
It's about 2m lines and 288 MB.
Then tokenize it:
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/tokenizer.perl wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en sed -i 's@my $mydir.*@my $mydir = ".";@' tokenizer.perl perl tokenizer.perl -l en -q < europarl.en > europarl.en.tok
Let's measure how long it takes to detokenize the corpus:
$ perl detokenizer.perl -l en -q < europarl.en.tok | pv > europarl.en.tok.detok 288MB 0:11:30 [ 427kB/s]
The speed on my virtual server was about 400-500 KB/s.
We can see the differences with this little Python 2.7 script:
import sys, itertools with open(sys.argv) as f1, open(sys.argv) as f2: count, total = 0, 0 for line1, line2 in itertools.izip(f1, f2): total += 1 if line1 != line2: print line1, line2 count += 1 print >> sys.stderr, count, total, 100.0*count/total
$ python2.7 diff.py europarl.en europarl.en.tok.detok Please rise, then, for this minute' s silence. Please rise, then, for this minute 's silence. In this respect, I accept this proposal to amend Directive 94/55/EC which has been tabled for discussion today. In this respect, I accept this proposal to amend Directive 94 / 55 / EC which has been tabled for discussion today. 117350 2007758 5.84482791253
We can see that this detokenizer can't deal with possession (denoted by
's) and specific identifiers (
94/55/EC). In total, it detokenized 117k out 2000k lines incorrectly which is an error of 5.84%.