peteris.rocks

Machine translation tools installation on Ubuntu Server

How to compile and install machine translation tools: moses, cdec, mgiza, fast align, kenlm

Last updated on

Here are copy & paste instructions on how to compile and install machine translation tools on Ubuntu Server 14.04 LTS. All tools can be installed separately and there are instructions on how to install the dependencies for each tools as well.

First, let's define two variables. $WORK will be the directory where compilation will take place. $TOOLS is the destination directory where the tools will be installed. I prefer to have them in their own directory and the instructions below reflect that.

export WORK=/tmp
export TOOLS=/opt

Make sure both directories exist and you have write permissions:

sudo mkdir -p $WORK $TOOLS
sudo chown $USER $WORK $TOOLS

Machine translation toolkits

Moses

Prerequisites. Moses needs a C++ compiler and the boost software libraries.

sudo apt-get -qq install git build-essential automake libtool \
  libboost-all-dev zlib1g-dev libbz2-dev liblzma-dev \
  libgoogle-perftools-dev python-dev \
  pigz

If you plan on using EMS (Experiment Management System) you can also install graphviz and imagemagick which will be used for generating graphs.

sudo apt-get -qq install graphviz imagemagick

CMPH (C Minimal Perfect Hashing Library) is needed for phrase table binarization.

cd $WORK
wget http://downloads.sourceforge.net/project/cmph/cmph/cmph-2.0.tar.gz
tar xf cmph-*.tar.gz
cd cmph-*
./configure --prefix=$PWD/build
make
make install

Install a Perl module that is used by one of the BLEU measurement scripts.

sudo PERL_MM_USE_DEFAULT=1 cpan install XML::Twig > /dev/null

Get moses from GitHub:

cd $WORK
git clone --depth=1 https://github.com/moses-smt/mosesdecoder
cd mosesdecoder

Compile moses:

./bjam -a --static -j`nproc` --with-mm --with-cmph=$WORK/cmph-2.0/build

Here are the parameters:

Copy all binaries and the scripts directory:

mkdir -p $TOOLS/moses
find bin -maxdepth 1 -type f -executable -exec cp {} $TOOLS/moses \;
cp -r scripts $TOOLS/moses

cdec

Prerequisites.

sudo apt-get -qq install git build-essential cmake flex libboost-all-dev libeigen3-dev libbz2-dev liblzma-dev

Get cdec from GitHub:

cd $WORK
git clone --depth=1 https://github.com/redpony/cdec
cd cdec

Compile it:

mkdir build
cd build
cmake ..
make -j`nproc`

Copy what you need:

mkdir -p $TOOLS/cdec
find . -type f -executable | grep -v CMakeFiles | grep -v .so | xargs -i cp {} $TOOLS/cdec
cp -r ../corpus $TOOLS/cdec

Word alignment

Multi-threaded GIZA++ (mgiza)

Prerequisites.

sudo apt-get -qq install git build-essential cmake libboost-all-dev

Get mgiza from GitHub:

cd $WORK
git clone --depth=1 https://github.com/moses-smt/mgiza.git
cd mgiza/mgizapp

Compile it:

mkdir build
cd build
cmake ..
make -j`nproc`

Copy the binaries:

mkdir -p $TOOLS/mgiza
cp bin/* $TOOLS/mgiza

fast align

Prerequisites.

sudo apt-get install -qq git build-essential cmake libgoogle-perftools-dev libsparsehash-dev

Get fast align from GitHub:

cd $WORK
git clone --depth=1 https://github.com/clab/fast_align
cd fast_align

Compile it statically:

echo 'SET(CMAKE_EXE_LINKER_FLAGS "-static")' >> CMakeLists.txt
mkdir build
cd build
cmake ..
make -j`nproc`

Copy fast_align and also atools which can be used for alignment symmetrization:

mkdir -p $TOOLS/fast_align
cp atools fast_align $TOOLS/fast_align

Note that if you want incremental fast align then you should use the fast align that is bundled with cdec.

Language models

KenLM

Prerequisites.

sudo apt-get -qq install git build-essential libboost-all-dev

Get KenLM from GitHub:

cd $WORK
git clone --depth=1 https://github.com/kpu/kenlm
cd kenlm

Compile it:

./bjam -a --static -j`nproc`

Copy the binaries:

mkdir -p $TOOLS/kenlm
find bin -maxdepth 1 -type f -executable -exec cp {} $TOOLS/kenlm \;

Tips and tricks

Strip binaries

If you do not plan on debugging moses and other tools, you can strip debugging and other unneeded symbols.

It can free up a lot of space. For instance, du -sh $TOOLS showed 1.2G before and 279M after stripping the binaries.

(find $TOOLS -type f -executable | xargs strip -s &> /dev/null) || true

Vagrantfile

Here is a Vagrantfile for you.

Vagrant.configure("2") do |config|
  # Ubuntu 14.04 LTS x64 official cloud image
  config.vm.box = "ubuntu/trusty64"
  config.vm.box_check_update = false

  # VirtualBox
  config.vm.provider "virtualbox" do |vb|
    vb.name = "Machine Translation" # friendly name that shows up in Oracle VM VirtualBox Manager
    vb.memory = 4096 # memory in megabytes
    vb.cpus = 4 # cpu cores, can't be more than the host actually has!
    vb.customize ["modifyvm", :id, "--natdnshostresolver1", "on"] # fixes slow dns lookups
  end

  # use local ubuntu mirror
  config.vm.provision :shell, inline: "sed -i 's/archive.ubuntu.com/lv.archive.ubuntu.com/g' /etc/apt/sources.list"
  # add swap
  config.vm.provision :shell, inline: "fallocate -l 4G /swapfile && chmod 0600 /swapfile && mkswap /swapfile && swapon /swapfile && echo '/swapfile none swap sw 0 0' >> /etc/fstab"
  # refresh package sources
  config.vm.provision :shell, inline: "apt-get update"

  # enable logging in via ssh with a password
  config.ssh.username = "vagrant"
  config.ssh.password = "vagrant"
end

Then just do

vagrant up
vagrant ssh

and copy & paste away.