This is an overview of available PoS taggers for Latvian.
There's also a step by step tutorial on how to use one to tag an untokenized piece of text.
Table of Contents
LU MII Tagger
The Institute of Mathematics and Computer Science at the University of Latvia (LU MII in Latvian) have a part-of-speech tagger based on the Stanford POS Tagger.
The tagger is described in this paper and they claim an accuracy of 97.9% for PoS tags and an accuracy of 93.6% for full morphological tags (case, gender, number, tense, etc.).
Installation
It's written in Java and you can just download and run a .jar
file.
Here's how to install Java on Ubuntu 16.04:
sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get update
echo "oracle-java8-installer shared/accepted-oracle-license-v1-1 select true" | sudo debconf-set-selections
sudo apt-get install -y oracle-java8-installer
Download the tagger:
wget "https://oss.sonatype.org/service/local/artifact/maven/redirect?r=releases&g=lv.ailab.morphology&a=tagger&v=1.0.0&e=jar&c=jar-with-dependencies" \
-O tagger-1.0.0-jar-with-dependencies.jar
Use
Run the following command:
$ echo "KNAB lūdz apsūdzēt policijas amatpersonu par €300 kukuļņemšanu." | java -jar tagger-1.0.0-jar-with-dependencies.jar
1 KNAB knab y y Vārdšķira=Saīsinājums|Lielo_burtu_lietojums=Rakstīts_ar_lielajiem_burtiem|LETA_lemma=knab|Avota_pamatforma=knab|Pamatforma=knab
2 lūdz lūgt v__m___2s__ vmnm0t12san Laiks=Nepiemīt|Skaitlis=Vienskaitlis|Persona=2|Darbības_vārda_tips=Patstāvīgs_darbības_vārds|Atgriezeniskums=Nē|Avota_pamatforma=lūgt|Pamatforma=lūgt|Noliegums=Nē|Transitivitāte=Pārejošs|Vārdšķira=Darbības_vārds|LETA_lemma=lūgt|Izteiksme=Pavēles|Kārta=Darāmā
3 apsūdzēt apsūdzēt v__n___00__ vmnn0t3000n Laiks=Nepiemīt|Skaitlis=Nepiemīt|Persona=Nepiemīt|Darbības_vārda_tips=Patstāvīgs_darbības_vārds|Atgriezeniskums=Nē|Avota_pamatforma=apsūdzēt|Pamatforma=apsūdzēt|Noliegums=Nē|Transitivitāte=Pārejošs|Vārdšķira=Darbības_vārds|LETA_lemma=apsūdzēt|Izteiksme=Nenoteiksme|Kārta=Nepiemīt
4 policijas policija n_fsg_ ncfsg4 Skaitlis=Vienskaitlis|Avota_pamatforma=policija|Lietvārda_tips=Sugas_vārds|Pamatforma=policija|Vārdšķira=Lietvārds|Locījums=Ģenitīvs|Dzimte=Sieviešu|LETA_lemma=policija
5 amatpersonu amatpersona n_fsa_ ncfsa4 Skaitlis=Vienskaitlis|Avota_pamatforma=amatpersona|Lietvārda_tips=Sugas_vārds|Pamatforma=amatpersona|Vārdšķira=Lietvārds|Locījums=Akuzatīvs|Dzimte=Sieviešu|LETA_lemma=amatpersona
6 par par s____ sppdn Skaitlis=Daudzskaitlis|Vietas_apstākļa_nozīme=Nē|Avota_pamatforma=par|Pamatforma=par|Vārdšķira=Prievārds|LETA_lemma=par|Novietojums=Pirms|Rekcija=Datīvs
7 € € xx xx Vārdšķira=Reziduālis|LETA_lemma=_rare_|Avota_pamatforma=€|Pamatforma=€
8 300 300 xn xn Reziduāļa_tips=Skaitlis_cipariem|Vārdšķira=Reziduālis|LETA_lemma=300|Pamatforma=300
9 kukuļņemšanu kukuļņemšana n_fsa_ ncfsa4 Skaitlis=Vienskaitlis|Avota_pamatforma=kukuļņemšana|Lietvārda_tips=Sugas_vārds|Pamatforma=kukuļņemšana|Vārdšķira=Lietvārds|Locījums=Akuzatīvs|Dzimte=Sieviešu|LETA_lemma=_rare_
10 . . zs zs Pieturzīmes_tips=Punkts|Vārdšķira=Pieturzīme|LETA_lemma=.|Avota_pamatforma=.|Pamatforma=.
As you can see, it takes plain text in standard input and performs tokenization, morphological analysis and disambiguation for you.
The output is the token number, original token, lemma, reduced morphological tag, full morphological tag, followed by various attributes.
You can tag a plain text file like this:
java -jar tagger-1.0.0-jar-with-dependencies.jar < input.txt > output.txt
Tag Set
The tag set is based on MULTEXT-East and you can see a (probably outdated) table of all possible tags here. The most up-to-date version of the tag set is available in an XML format on GitHub.
Here is a short description of tags from the example above.
lūdz lūgt v__m___2s__
:lūgt
- lemma,v
- verb,m
- imperative mood (incorrect),2
- 2nd person (incorrect),s
- singularpolicijas policija n_fsg_
:policija
- lemma,n
- noun,f
- feminine,s
- singular,g
- genitive case
TensorFlow Tagger
Pēteris Paikens (it's not me, we have the same first name) who also developed the LU MII tagger has done some research on using deep neural networks for PoS/morphological tagging for Latvian.
His research was published at Baltic HLT 2016 (I built the first version of the website, by the way) and it achieved the best results so far. It's built with TensorFlow. What's interesting is that it doesn't require a morphological analyzer at all.
The source code is available on GitHub and you can read the paper in the conference proceedings (page 160, near the end).
While it's got the highest accuracy, it's not the easiest tool to use (you have to set up TensorFlow to run it) and it is slower than the others tools.
- Paikens-2012 is the LU MII tagger described above
- Ņikiforovs-2015 is my perceptron tagger described below
Perceptron Tagger
I developed a part-of-speech tagger for Latvian based on the multiclass averaged perceptron machine learning algorithm in 2013/2014.
I did my Bachelor's thesis (in Latvian) on it in 2015.
It's got a slightly better accuracy than the LU MII tagger and it's much, much faster because it's so simple. By fast I mean you can tag 400,000 tokens per second on an i7.
The source code is available on GitHub: here (Apache 2.0) and here (GPLv3).
Note that it's an old version of the code. While working at Tilde (described below), I improved it but the source code is proprietary.
Tilde Tagger
Tilde is a company that develops language solutions for the Baltic languages. It's also where I worked for more than 3.5 years.
They had a PoS tagger based on the maximum entropy, published here in 2011. Nowadays it's old, slow and the least accurate of all taggers.
I developed my perceptron tagger in my spare time and made it open source. Later I worked on an improved version for Tilde which unfortunately is proprietary software. The proprietary version is basically built on their proprietary annotated corpus and there's a nice command line interface for it. Other than that it's the same code and features as the open source version.
You can contact them to get access to the tool or use it via a web service.