Affix POS tagging with the CLTK
Important! This post contains incorrect accuracy scores. See http://cltk.org/blog/2015/08/02/corrected-stats-pos-tagger-accuracy.html for better information.
Affix tagging looks at the beginning or ending of a word and chooses a POS tag. I’ve worked through a number of prefixes and suffixes for Greek and Latin. Combined with ongoing results from my previous post, the affix tagger results are:
Tagger | Greek | Latin |
---|---|---|
1–gram > 2–gram > 3–gram | 0.972572292486997 | 0.9796586568315676 |
UnigramTagger() | 0.9196123340065213 | 0.8873793350017877 |
BigramTagger() | 0.8125528866223641 | 0.7211862333703404 |
TrigramTagger() | 0.8101779247007322 | 0.8162128596428504 |
TnTTagger() | ? | 0.9871855183184991 |
2-char prefix | 0.1399196687464037 | 0.11491635775172647 |
3-char prefix | 0.166630938815114 | 0.15512861524565794 |
4-char prefix | 0.16541243103584444 | 0.16120655589635513 |
2-char suffix | 0.20355849401464465 | 0.22117682479348175 |
3-char suffix | 0.2462203693883768 | 0.2615772538245865 |
4-char suffix | 0.23366014915437816 | 0.2749374329638899 |
5-char suffix | 0.18811560028431848 | 0.2376041999887097 |
6-char suffix | 0.1341543217537486 | 0.1721957736672751 |
Clearly these taggers, by themselves, are not as good as the others. Still, they surely have some good utility in highly inflected languages. Their value will be apparent, I suspect, when combined in a backoff chain with n–gram or other taggers.
But before doing extenisive tests of all of these taggers used in various backoff orders, I will be writing a proper noun dictionary for Greek and Latin, drawn from the TLG and PHI5 corpora. This will provide good “cleanup” at the very end of a backoff chain, as I have noticed that even my best taggers are missing many, if not most, proper nouns.