A Mind for Language

Nov 22, 2019

Fun with vocab-tools: comparing chapter vocab and glossing it

More fun with James Tauber's vocabulary-tools. So what if you're reading through an NT book chapter by chapter and you wonder what new vocab you're likely to encounter in the next chapter that wasn't in the previous one? Vocabulary-tools can help you figure that out.

Vocabulary-tools doesn't include a glossing tool (as far as I know), but here is a simple one based on a gloss list from the Abbott-Smith NT Greek lexicon (which you can get here).

from greek_normalisation.utils import nfc

class Glosser():
    def __init__(self):
        self.data = dict()
        with open("gloss-dict.tab", 'r', encoding="UTF-8") as f:
            for line in f:
                parts = line.split("\t", maxsplit=1)
                if len(parts) > 1:
                    self.data[nfc(parts[0])] = parts[1]

    def get(self, l):
        normed = nfc(l)
        if normed in self.data:
            return self.data[normed]
        else:
            print(f"{normed} not found in Abott Smith")
            return ''

Now we can combine that with the following code and run by typing py analyze_chapter.py <new-cpt-num> and it will print out a list of words that occur less than LIM times in the NT, the number of occurrences, and the gloss from Abbott-Smith (if found). I'm currently reading Acts; if you want a different book, then you'll need to replace BOOK_ABBV['ACT'] with the book code for the book you want to read. You can figure out this code from the vocabular-tools module.

from collections import Counter
from gnt_data import get_tokens, get_tokens_by_chunk, TokenType, ChunkType
from abott_glosser import Glosser
import sys

new_cpt = int(sys.argv[1])
BOOK_ABBV = {"GLA": "69", "1JN" : "83", "ACT": "65"}

# Get all lemmas in GNT
gnt_lemmas = Counter(get_tokens(TokenType.lemma))

# format last chatper marker
last_cpt = "0" + str(new_cpt -1) if new_cpt -1 < 10 else str(new_cpt -1)

# Get lemmmas for current and previous chapters
LAST_CHAPTER =  Counter(get_tokens(TokenType.lemma, ChunkType.chapter, BOOK_ABBV['ACT']+ last_cpt))
NEW_CHAPTER = Counter(get_tokens(TokenType.lemma, ChunkType.chapter, BOOK_ABBV['ACT']+ str(new_cpt)))

# get GNT freq, rather than freq in current chatper
def getNTFreq(nt, tgt):
    out = {}
    for t in tgt.items():
        lemma = t[0]
        if lemma in nt:
            out[lemma] = nt[lemma]
    return out

#subtract vocab from the last chatper from list
ACT_NT_FREQ = getNTFreq(gnt_lemmas, NEW_CHAPTER - LAST_CHAPTER)

# Filter lemmas based on those that occur less than LIM in the GNT as a whole
LIM = 10
freq = lambda x: int(x[1]) < LIM
TGT = sorted(list(filter(freq,ACT_NT_FREQ.items())), key=lambda x: x[0])

print(len(TGT))

# setup glosser
glosser = Glosser()

# output results
for l in TGT:
    print(f"{l[0]}\t{l[1]}\t{glosser.get(l[0])}")

Running py analyze_chapter.py 11 on Acts 11 got the following output.

21
Κλαύδιος        3       Claudius | C. Lysias

Κυρηναῖος       6       of Cyrene | a Cyrenæan

Κύπριος 3       of Cyprus | Cyprian

Κύπρος  5       Cyprus

Στέφανος        7       Stephen

Ταρσός  3       Tarsus

Φοινίκη not found in Abott Smith
Φοινίκη 3
Χριστιανός      3       a Christian

διασπείρω       3       to scatter abroad, disperse

εὐπορέομαι not found in Abott Smith
εὐπορέομαι      1
καθεξῆς 5       successively | in order | afterwards

προσμένω        7       to wait longer | continue | remain still | to remain with | to remain attached to | cleave unto | abide in

πρώτως  1       first

σημαίνω 6       to give a sign, signify, indicate

ἀναζητέω        3       to look for | seek carefully

ἀνασπάω 2       to draw up

Ἅγαβος not found in Abott Smith
Ἅγαβος  2
ἐκτίθημι        4       to set out, expose | to set forth, expound

Ἑλληνιστής      3       a Hellenist |  Grecian Jew

ἡσυχάζω 5       to be still | to rest from labour | to live quietly | to be silent

ἴσος    8       equal | the same