Here are copy & paste instructions on how to compile and install machine translation tools on Ubuntu Server 14.04 LTS. All tools can be installed separately and there are instructions on how to install the dependencies for each tools as well.
First, let's define two variables. $WORK
will be the directory
where compilation will take place.
$TOOLS
is the destination directory where the tools will be installed.
I prefer to have them in their own directory and the instructions below reflect that.
export WORK=/tmp
export TOOLS=/opt
Make sure both directories exist and you have write permissions:
sudo mkdir -p $WORK $TOOLS
sudo chown $USER $WORK $TOOLS
Machine translation toolkits
Moses
Prerequisites. Moses needs a C++ compiler and the boost software libraries.
sudo apt-get -qq install git build-essential automake libtool \
libboost-all-dev zlib1g-dev libbz2-dev liblzma-dev \
libgoogle-perftools-dev python-dev \
pigz
If you plan on using EMS (Experiment Management System) you can also install graphviz
and imagemagick
which will be used for generating graphs.
sudo apt-get -qq install graphviz imagemagick
CMPH (C Minimal Perfect Hashing Library) is needed for phrase table binarization.
cd $WORK
wget http://downloads.sourceforge.net/project/cmph/cmph/cmph-2.0.tar.gz
tar xf cmph-*.tar.gz
cd cmph-*
./configure --prefix=$PWD/build
make
make install
Install a Perl module that is used by one of the BLEU measurement scripts.
sudo PERL_MM_USE_DEFAULT=1 cpan install XML::Twig > /dev/null
Get moses from GitHub:
cd $WORK
git clone --depth=1 https://github.com/moses-smt/mosesdecoder
cd mosesdecoder
Compile moses:
./bjam -a --static -j`nproc` --with-mm --with-cmph=$WORK/cmph-2.0/build
Here are the parameters:
-a
- make sure to recompile everything--static
- build static binaries-j
- number of parallel jobs,nproc
- returns number of cores--with-mm
- memory mapped suffix array phrase tables--with-cmph
- with CMPH for phrase table binarization
Copy all binaries and the scripts
directory:
mkdir -p $TOOLS/moses
find bin -maxdepth 1 -type f -executable -exec cp {} $TOOLS/moses \;
cp -r scripts $TOOLS/moses
cdec
Prerequisites.
sudo apt-get -qq install git build-essential cmake flex libboost-all-dev libeigen3-dev libbz2-dev liblzma-dev
Get cdec from GitHub:
cd $WORK
git clone --depth=1 https://github.com/redpony/cdec
cd cdec
Compile it:
mkdir build
cd build
cmake ..
make -j`nproc`
Copy what you need:
mkdir -p $TOOLS/cdec
find . -type f -executable | grep -v CMakeFiles | grep -v .so | xargs -i cp {} $TOOLS/cdec
cp -r ../corpus $TOOLS/cdec
Word alignment
Multi-threaded GIZA++ (mgiza)
Prerequisites.
sudo apt-get -qq install git build-essential cmake libboost-all-dev
Get mgiza from GitHub:
cd $WORK
git clone --depth=1 https://github.com/moses-smt/mgiza.git
cd mgiza/mgizapp
Compile it:
mkdir build
cd build
cmake ..
make -j`nproc`
Copy the binaries:
mkdir -p $TOOLS/mgiza
cp bin/* $TOOLS/mgiza
fast align
Prerequisites.
sudo apt-get install -qq git build-essential cmake libgoogle-perftools-dev libsparsehash-dev
cd $WORK
git clone --depth=1 https://github.com/clab/fast_align
cd fast_align
Compile it statically:
echo 'SET(CMAKE_EXE_LINKER_FLAGS "-static")' >> CMakeLists.txt
mkdir build
cd build
cmake ..
make -j`nproc`
Copy fast_align
and also atools
which can be used for alignment symmetrization:
mkdir -p $TOOLS/fast_align
cp atools fast_align $TOOLS/fast_align
Note that if you want incremental fast align then you should use the fast align that is bundled with cdec.
Language models
KenLM
Prerequisites.
sudo apt-get -qq install git build-essential libboost-all-dev
Get KenLM from GitHub:
cd $WORK
git clone --depth=1 https://github.com/kpu/kenlm
cd kenlm
Compile it:
./bjam -a --static -j`nproc`
Copy the binaries:
mkdir -p $TOOLS/kenlm
find bin -maxdepth 1 -type f -executable -exec cp {} $TOOLS/kenlm \;
Tips and tricks
Strip binaries
If you do not plan on debugging moses and other tools, you can strip debugging and other unneeded symbols.
It can free up a lot of space. For instance, du -sh $TOOLS
showed 1.2G
before and
279M
after stripping the binaries.
(find $TOOLS -type f -executable | xargs strip -s &> /dev/null) || true
Vagrantfile
Here is a Vagrantfile
for you.
Vagrant.configure("2") do |config|
# Ubuntu 14.04 LTS x64 official cloud image
config.vm.box = "ubuntu/trusty64"
config.vm.box_check_update = false
# VirtualBox
config.vm.provider "virtualbox" do |vb|
vb.name = "Machine Translation" # friendly name that shows up in Oracle VM VirtualBox Manager
vb.memory = 4096 # memory in megabytes
vb.cpus = 4 # cpu cores, can't be more than the host actually has!
vb.customize ["modifyvm", :id, "--natdnshostresolver1", "on"] # fixes slow dns lookups
end
# use local ubuntu mirror
config.vm.provision :shell, inline: "sed -i 's/archive.ubuntu.com/lv.archive.ubuntu.com/g' /etc/apt/sources.list"
# add swap
config.vm.provision :shell, inline: "fallocate -l 4G /swapfile && chmod 0600 /swapfile && mkswap /swapfile && swapon /swapfile && echo '/swapfile none swap sw 0 0' >> /etc/fstab"
# refresh package sources
config.vm.provision :shell, inline: "apt-get update"
# enable logging in via ssh with a password
config.ssh.username = "vagrant"
config.ssh.password = "vagrant"
end
Then just do
vagrant up
vagrant ssh
and copy & paste away.