Detokenization in Machine Translation

Detokenization is the inverse of tokenization: removal of spaces between tokens when required. For instance, there should not be a space between a word and a period (both are separate tokens) at the end of a sentence.

Tokenized: Hello world !
Detokenized: Hello world!

It is usually the last step in the machine translation pipeline.

Moses detokenizer

It's a decent detokenizer written in Perl and requires no dependences. It comes with the Moses toolkit. It's licensed under the LGPL.

It's a mess of ifs and regexes but supports a wide range of languages even when the language code is not explicitly mentioned in the code.

Languages mentioned in the code: English, French, Italian, Czech, Finnish, Chinese, Japanese, Korean.

Installation

It's a single file Perl script and has no dependencies.

You can grab it from the Moses GitHub repository:

wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/detokenizer.perl
chmod +x detokenizer.perl

Usage

$ echo "Hello world !" | perl detokenizer.perl -q -l en
Hello world!

The q flag stands for quiet and will suppress the script version output to stderr.

Without -q the output is

$ echo "Hello world !" | perl detokenizer.perl -l en
Detokenizer Version $Revision: 4134 $
Language: en
Hello world!

If you use a language code that is not in the whitelist, you'll get a friendly warning about it:

echo "Hello world !" | perl detokenizer.perl -q -l lv
Warning: No built-in rules for language lv.
Hello world!

Performance

We are going to tokenize the English European Parlament corpus and then try to detokenize it.

First, download the corpus:

wget http://opus.lingfil.uu.se/Europarl/en-es.txt.zip
unzip -p en-es.txt.zip Europarl.en-es.en > europarl.en
rm en-es.txt.zip
wc -l europarl.en* ; du -h europarl.en*

It's about 2m lines and 288 MB.

Then tokenize it:

wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/tokenizer.perl
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en
sed -i 's@my $mydir.*@my $mydir = ".";@' tokenizer.perl
perl tokenizer.perl -l en -q < europarl.en > europarl.en.tok

Let's measure how long it takes to detokenize the corpus:

$ perl detokenizer.perl -l en -q < europarl.en.tok | pv > europarl.en.tok.detok
288MB 0:11:30 [ 427kB/s]

The speed on my virtual server was about 400-500 KB/s.

We can see the differences with this little Python 2.7 script:

import sys, itertools
with open(sys.argv[1]) as f1, open(sys.argv[2]) as f2:
  count, total = 0, 0
  for line1, line2 in itertools.izip(f1, f2):
    total += 1
    if line1 != line2:
      print line1, line2
      count += 1
  print >> sys.stderr, count, total, 100.0*count/total

$ python2.7 diff.py europarl.en europarl.en.tok.detok

Please rise, then, for this minute' s silence.
Please rise, then, for this minute 's silence.

In this respect, I accept this proposal to amend Directive 94/55/EC which has been tabled for discussion today.
In this respect, I accept this proposal to amend Directive 94 / 55 / EC which has been tabled for discussion today.

117350 2007758 5.84482791253

We can see that this detokenizer can't deal with possession (denoted by 's) and specific identifiers (94/55/EC). In total, it detokenized 117k out 2000k lines incorrectly which is an error of 5.84%.