Affix POS tagging with the CLTK

27 Oct 2014

Important! This post contains incorrect accuracy scores. See http://cltk.org/blog/2015/08/02/corrected-stats-pos-tagger-accuracy.html for better information.

Affix tagging looks at the beginning or ending of a word and chooses a POS tag. I’ve worked through a number of prefixes and suffixes for Greek and Latin. Combined with ongoing results from my previous post, the affix tagger results are:

Greek and Latin tagger accuracy
Tagger Greek Latin
1–gram > 2–gram > 3–gram 0.972572292486997 0.9796586568315676
UnigramTagger() 0.9196123340065213 0.8873793350017877
BigramTagger() 0.8125528866223641 0.7211862333703404
TrigramTagger() 0.8101779247007322 0.8162128596428504
TnTTagger() ? 0.9871855183184991
2-char prefix 0.1399196687464037 0.11491635775172647
3-char prefix 0.166630938815114 0.15512861524565794
4-char prefix 0.16541243103584444 0.16120655589635513
2-char suffix 0.20355849401464465 0.22117682479348175
3-char suffix 0.2462203693883768 0.2615772538245865
4-char suffix 0.23366014915437816 0.2749374329638899
5-char suffix 0.18811560028431848 0.2376041999887097
6-char suffix 0.1341543217537486 0.1721957736672751

Clearly these taggers, by themselves, are not as good as the others. Still, they surely have some good utility in highly inflected languages. Their value will be apparent, I suspect, when combined in a backoff chain with n–gram or other taggers.

But before doing extenisive tests of all of these taggers used in various backoff orders, I will be writing a proper noun dictionary for Greek and Latin, drawn from the TLG and PHI5 corpora. This will provide good “cleanup” at the very end of a backoff chain, as I have noticed that even my best taggers are missing many, if not most, proper nouns.

Recent posts