Notebooks
My notebooks and lecture materials are hosted at github.com/kylepjohnson/ipython. - An example use of `Counter().most_common()` to find most frequently used words (non-stemmed) in the TLG.
- 10,000 most common Latin wordsAn example use of `Counter().most_common()` to find most frequently used words (non-stemmed) in the PHI5.
- Cross–validates the accuracy of the CLTK's taggers, giving mean and standard deviation of each. This is a good check of the tagger's accuracy and demonstrates that the models are not overfit to the data.
- Cross–validates the accuracy of the CLTK's taggers, giving mean and standard deviation of each. This is a good check of the tagger's accuracy and demonstrates that the models are not overfit to the data.
- The lexical diversity of all authors in the TLG.
- The lexical diversity of all authors in the PHI5.
- Generates a 131,044-line file of the scores of all PHI5 Latin authors' tf-idf similarity to one another. Explanatory blog post here.
- Generates a 3,157,729-line file of the scores of all TLG authors' tf-idf similarity to one another. Explanatory blog post here.
- This explores from a very high level some averages of words per sentence in all of Ancient Greek literature and within several genres (e.g., history, romance, philosophy, epic, tragedy, comedy). This may not seem like much, though to my knowledge this is the first survey of its kind.
- This notebook follows from "Greek authors' average words per sentence", looking instead at the PHI5 corpus in the Latin language. It offers basic table views sorted by words per sentence, total sentences, and total words. I also include a view limited to Roman historians.
- A quick overview of part–of–speech (POS) tagging using the NLTK's Unigram machine learning algorithms combined with the CLTK's POS training set. Accuracies are evaluated, too.
- A quick overview of part–of–speech (POS) tagging using the NLTK's Bigram machine learning algorithms combined with the CLTK's POS training set. Accuracies are evaluated, too.
- A quick overview of part–of–speech (POS) tagging using the NLTK's Trigram machine learning algorithms combined with the CLTK's POS training set. Accuracies are evaluated, too.
- Demonstrates the excellent results (97% accuracy) of combining a unigram, bigram, and trigram together in a backoff chain for Ancient Greek.
- Demonstrates the excellent results (98% accuracy) of combining a unigram, bigram, and trigram together in a backoff chain for Classical Latin.
- Demonstrates use of a TnT tagger for Greek. The `evaluate()` function does not finish after many hours of computing (on my machine, at least), though it does work. See example of it in action.
- Demonstrates the 99% accuracy a TnT tagger for Classical Latin.
- Experiments in prefix and suffix tagging, with accuracies.
- Experiments in prefix tagging, with accuracies.
- Experiments in sufffix tagging, with accuracies.
- How to make a human–editable POS tagged file for a TLG author's work.