Important! This post contains incorrect accuracy scores. See http://cltk.org/blog/2015/08/02/corrected-stats-pos-tagger-accuracy.html for better information.
My previous post built on the CLTK’s n–gram taggers and used them in conjunction with each other. Now I am considering a TnT tagger which I have created, whose results are even more promising than the backoff tagger. My detailed notes are available in /notebooks.
Before diving into what a TnT tagger does, here are its results, along with those from my previous experiments.
|1–gram > 2–gram > 3–gram||0.972572292486997||0.9796586568315676|
Judging by the Latin example, these results are very promising. The tagger works for Greek, though
evaluate() does not finish running after almost 10 hours on my personal computer (a fairly new MacBook), so I will need to borrow time on something more powerful.
“TnT” is short for “Trigrams’n’Tags”. Its strength over n–gram taggers used in a backoff is that the TnT tagger evaluates the probability of a tag among all three n–gram options, not just one at a time. The results of this tagger speak for themselves, though it will fail on unknown words. In continuing my work I plan on creating affix taggers for each language, which will read the tags of morphological beginnings and endings, and then pass this as a special tagger for unknown words with, e.g.,
tnt.TnT(unk=suffix_tagger, Trained=True). In doing so, I will be following the lead of the TnT’s creator, whose initial research, available as a .pdf here. For more on the NLTK’s tagger, see also pp. 100–102 of Perkins’s, Python Text Processing and of course the NLTK’s API docs.