Latvian Part-of-Speech tagging

This is an overview of available PoS taggers for Latvian.

There's also a step by step tutorial on how to use one to tag an untokenized piece of text.

LU MII Tagger

The Institute of Mathematics and Computer Science at the University of Latvia (LU MII in Latvian) have a part-of-speech tagger based on the Stanford POS Tagger.

The tagger is described in this paper and they claim an accuracy of 97.9% for PoS tags and an accuracy of 93.6% for full morphological tags (case, gender, number, tense, etc.).

Installation

It's written in Java and you can just download and run a .jar file.

Here's how to install Java on Ubuntu 16.04:

sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get update
echo "oracle-java8-installer shared/accepted-oracle-license-v1-1 select true" | sudo debconf-set-selections
sudo apt-get install -y oracle-java8-installer

Download the tagger:

wget "https://oss.sonatype.org/service/local/artifact/maven/redirect?r=releases&g=lv.ailab.morphology&a=tagger&v=1.0.0&e=jar&c=jar-with-dependencies" \
  -O tagger-1.0.0-jar-with-dependencies.jar

Use

Run the following command:

$ echo "KNAB lūdz apsūdzēt policijas amatpersonu par €300 kukuļņemšanu." | java -jar tagger-1.0.0-jar-with-dependencies.jar
1       KNAB    knab    y       y       Vārdšķira=Saīsinājums|Lielo_burtu_lietojums=Rakstīts_ar_lielajiem_burtiem|LETA_lemma=knab|Avota_pamatforma=knab|Pamatforma=knab
2       lūdz    lūgt    v__m___2s__     vmnm0t12san     Laiks=Nepiemīt|Skaitlis=Vienskaitlis|Persona=2|Darbības_vārda_tips=Patstāvīgs_darbības_vārds|Atgriezeniskums=Nē|Avota_pamatforma=lūgt|Pamatforma=lūgt|Noliegums=Nē|Transitivitāte=Pārejošs|Vārdšķira=Darbības_vārds|LETA_lemma=lūgt|Izteiksme=Pavēles|Kārta=Darāmā
3       apsūdzēt        apsūdzēt        v__n___00__     vmnn0t3000n     Laiks=Nepiemīt|Skaitlis=Nepiemīt|Persona=Nepiemīt|Darbības_vārda_tips=Patstāvīgs_darbības_vārds|Atgriezeniskums=Nē|Avota_pamatforma=apsūdzēt|Pamatforma=apsūdzēt|Noliegums=Nē|Transitivitāte=Pārejošs|Vārdšķira=Darbības_vārds|LETA_lemma=apsūdzēt|Izteiksme=Nenoteiksme|Kārta=Nepiemīt
4       policijas       policija        n_fsg_  ncfsg4  Skaitlis=Vienskaitlis|Avota_pamatforma=policija|Lietvārda_tips=Sugas_vārds|Pamatforma=policija|Vārdšķira=Lietvārds|Locījums=Ģenitīvs|Dzimte=Sieviešu|LETA_lemma=policija
5       amatpersonu     amatpersona     n_fsa_  ncfsa4  Skaitlis=Vienskaitlis|Avota_pamatforma=amatpersona|Lietvārda_tips=Sugas_vārds|Pamatforma=amatpersona|Vārdšķira=Lietvārds|Locījums=Akuzatīvs|Dzimte=Sieviešu|LETA_lemma=amatpersona
6       par     par     s____   sppdn   Skaitlis=Daudzskaitlis|Vietas_apstākļa_nozīme=Nē|Avota_pamatforma=par|Pamatforma=par|Vārdšķira=Prievārds|LETA_lemma=par|Novietojums=Pirms|Rekcija=Datīvs
7       €       €       xx      xx      Vārdšķira=Reziduālis|LETA_lemma=_rare_|Avota_pamatforma=€|Pamatforma=€
8       300     300     xn      xn      Reziduāļa_tips=Skaitlis_cipariem|Vārdšķira=Reziduālis|LETA_lemma=300|Pamatforma=300
9       kukuļņemšanu    kukuļņemšana    n_fsa_  ncfsa4  Skaitlis=Vienskaitlis|Avota_pamatforma=kukuļņemšana|Lietvārda_tips=Sugas_vārds|Pamatforma=kukuļņemšana|Vārdšķira=Lietvārds|Locījums=Akuzatīvs|Dzimte=Sieviešu|LETA_lemma=_rare_
10      .       .       zs      zs      Pieturzīmes_tips=Punkts|Vārdšķira=Pieturzīme|LETA_lemma=.|Avota_pamatforma=.|Pamatforma=.

As you can see, it takes plain text in standard input and performs tokenization, morphological analysis and disambiguation for you.

The output is the token number, original token, lemma, reduced morphological tag, full morphological tag, followed by various attributes.

You can tag a plain text file like this:

java -jar tagger-1.0.0-jar-with-dependencies.jar < input.txt > output.txt

Tag Set

The tag set is based on MULTEXT-East and you can see a (probably outdated) table of all possible tags here. The most up-to-date version of the tag set is available in an XML format on GitHub.

Here is a short description of tags from the example above.

lūdz lūgt v__m___2s__: lūgt - lemma, v - verb, m - imperative mood (incorrect), 2 - 2nd person (incorrect), s - singular
policijas policija n_fsg_: policija - lemma, n - noun, f - feminine, s - singular, g - genitive case

TensorFlow Tagger

Pēteris Paikens (it's not me, we have the same first name) who also developed the LU MII tagger has done some research on using deep neural networks for PoS/morphological tagging for Latvian.

His research was published at Baltic HLT 2016 (I built the first version of the website, by the way) and it achieved the best results so far. It's built with TensorFlow. What's interesting is that it doesn't require a morphological analyzer at all.

The source code is available on GitHub and you can read the paper in the conference proceedings (page 160, near the end).

While it's got the highest accuracy, it's not the easiest tool to use (you have to set up TensorFlow to run it) and it is slower than the others tools.

TensorFlow tagger accuracy comparison

Paikens-2012 is the LU MII tagger described above
Ņikiforovs-2015 is my perceptron tagger described below

Perceptron Tagger

I developed a part-of-speech tagger for Latvian based on the multiclass averaged perceptron machine learning algorithm in 2013/2014.

I did my Bachelor's thesis (in Latvian) on it in 2015.

It's got a slightly better accuracy than the LU MII tagger and it's much, much faster because it's so simple. By fast I mean you can tag 400,000 tokens per second on an i7.

The source code is available on GitHub: here (Apache 2.0) and here (GPLv3).

Note that it's an old version of the code. While working at Tilde (described below), I improved it but the source code is proprietary.

Tilde Tagger

Tilde is a company that develops language solutions for the Baltic languages. It's also where I worked for more than 3.5 years.

They had a PoS tagger based on the maximum entropy, published here in 2011. Nowadays it's old, slow and the least accurate of all taggers.

I developed my perceptron tagger in my spare time and made it open source. Later I worked on an improved version for Tilde which unfortunately is proprietary software. The proprietary version is basically built on their proprietary annotated corpus and there's a nice command line interface for it. Other than that it's the same code and features as the open source version.

You can contact them to get access to the tool or use it via a web service.