TnT POS tagging with the CLTK

25 Oct 2014

Important! This post contains incorrect accuracy scores. See http://cltk.org/blog/2015/08/02/corrected-stats-pos-tagger-accuracy.html for better information.

My previous post built on the CLTK’s n–gram taggers and used them in conjunction with each other. Now I am considering a TnT tagger which I have created, whose results are even more promising than the backoff tagger. My detailed notes are available in /notebooks.

Before diving into what a TnT tagger does, here are its results, along with those from my previous experiments.

Greek and Latin tagger accuracy
Tagger Greek Latin
1–gram > 2–gram > 3–gram 0.972572292486997 0.9796586568315676
UnigramTagger() 0.9196123340065213 0.8873793350017877
BigramTagger() 0.8125528866223641 0.7211862333703404
TrigramTagger() 0.8101779247007322 0.8162128596428504
TnTTagger() ? 0.9871855183184991

Judging by the Latin example, these results are very promising. The tagger works for Greek, though evaluate() does not finish running after almost 10 hours on my personal computer (a fairly new MacBook), so I will need to borrow time on something more powerful.

“TnT” is short for “Trigrams’n’Tags”. Its strength over n–gram taggers used in a backoff is that the TnT tagger evaluates the probability of a tag among all three n–gram options, not just one at a time. The results of this tagger speak for themselves, though it will fail on unknown words. In continuing my work I plan on creating affix taggers for each language, which will read the tags of morphological beginnings and endings, and then pass this as a special tagger for unknown words with, e.g., tnt.TnT(unk=suffix_tagger, Trained=True). In doing so, I will be following the lead of the TnT’s creator, whose initial research, available as a .pdf here. For more on the NLTK’s tagger, see also pp. 100–102 of Perkins’s, Python Text Processing and of course the NLTK’s API docs.

Recent posts