Kyle P. Johnson | Notebooks | Kyle P. Johnson PhD

Notebooks

My notebooks and lecture materials are hosted at github.com/kylepjohnson/ipython.

10,000 most common Greek words
An example use of `Counter().most_common()` to find most frequently used words (non-stemmed) in the TLG.
10,000 most common Latin words
10,000 most common Latin wordsAn example use of `Counter().most_common()` to find most frequently used words (non-stemmed) in the PHI5.
Cross-validation of the CLTK's Latin POS taggers, Latin
Cross–validates the accuracy of the CLTK's taggers, giving mean and standard deviation of each. This is a good check of the tagger's accuracy and demonstrates that the models are not overfit to the data.
Cross-validation of the CLTK's Greek POS taggers, Greek
Cross–validates the accuracy of the CLTK's taggers, giving mean and standard deviation of each. This is a good check of the tagger's accuracy and demonstrates that the models are not overfit to the data.
Lexical diversity in the Greek canon
The lexical diversity of all authors in the TLG.
Lexical diversity in the Latin canon
The lexical diversity of all authors in the PHI5.
Tf-idf pairwise similarity, Latin
Generates a 131,044-line file of the scores of all PHI5 Latin authors' tf-idf similarity to one another. Explanatory blog post here.
Tf-idf pairwise similarity, Greek
Generates a 3,157,729-line file of the scores of all TLG authors' tf-idf similarity to one another. Explanatory blog post here.
Greek authors' average words per sentence
This explores from a very high level some averages of words per sentence in all of Ancient Greek literature and within several genres (e.g., history, romance, philosophy, epic, tragedy, comedy). This may not seem like much, though to my knowledge this is the first survey of its kind.
Latin authors' average words per sentence
This notebook follows from "Greek authors' average words per sentence", looking instead at the PHI5 corpus in the Latin language. It offers basic table views sorted by words per sentence, total sentences, and total words. I also include a view limited to Roman historians.
Unigram tagging of Greek and Latin with the CLTK
A quick overview of part–of–speech (POS) tagging using the NLTK's Unigram machine learning algorithms combined with the CLTK's POS training set. Accuracies are evaluated, too.
Bigram tagging of Greek and Latin with the CLTK
A quick overview of part–of–speech (POS) tagging using the NLTK's Bigram machine learning algorithms combined with the CLTK's POS training set. Accuracies are evaluated, too.
Trigram tagging of Greek and Latin with the CLTK
A quick overview of part–of–speech (POS) tagging using the NLTK's Trigram machine learning algorithms combined with the CLTK's POS training set. Accuracies are evaluated, too.
N-gram backoff tagging, Greek
Demonstrates the excellent results (97% accuracy) of combining a unigram, bigram, and trigram together in a backoff chain for Ancient Greek.
N-gram backoff tagging, Latin
Demonstrates the excellent results (98% accuracy) of combining a unigram, bigram, and trigram together in a backoff chain for Classical Latin.
TnT tagging, Greek
Demonstrates use of a TnT tagger for Greek. The `evaluate()` function does not finish after many hours of computing (on my machine, at least), though it does work. See example of it in action.
TnT tagging, Latin
Demonstrates the 99% accuracy a TnT tagger for Classical Latin.
Affix tagging, Latin
Experiments in prefix and suffix tagging, with accuracies.
Affix tagging, Greek prefixes
Experiments in prefix tagging, with accuracies.
Affix tagging, Greek suffixes
Experiments in sufffix tagging, with accuracies.
Make human-editable POS text from TLG
How to make a human–editable POS tagged file for a TLG author's work.

Notebooks

Recent posts