A Mind for Language

Nov 14, 2019

Fun with James Tauber's vocabulary tools

James Tauber has written a set of vocabulary tools for the Greek New Testament (GNT).

I wanted to read Acts 10 and thought I'd see what the words occur there that occur less than 10 times in the GNT overall. The code in Listing 1 will get that list and print the word and its total GNT count to a text file.

Listing 1

from collections import Counter
from gnt_data import get_tokens, get_tokens_by_chunk, TokenType, ChunkType
import pprint

BOOK_ABBV = {"GLA": "69", "1JN" : "83", "ACT": "65"}

gnt_lemmas = Counter(get_tokens(TokenType.lemma))

ACT_10_lemmas = Counter(get_tokens(TokenType.lemma, ChunkType.chapter, BOOK_ABBV['ACT']+ "10"))

def getNTFreq(nt, tgt):
    out = {}
    for t in tgt.items():
        lemma = t[0]
        if lemma in nt:
            out[lemma] = nt[lemma]
    return out

ACT_NT_FREQ = getNTFreq(gnt_lemmas, ACT_10_lemmas)

freq = lambda x: int(x[1]) < 10

TGT = sorted(list(filter(freq,ACT_NT_FREQ.items())), key=lambda x: x[0])



with open("act_10.txt", 'w', encoding="UTF-8") as f:
    for l in TGT:
        print(f"{l[0]}\t\t{l[1]}", file=f)

I then wanted the glosses for these words.

I have a list of glosses extracted from the Abbot-Smith NT Greek lexicon (the list is available here). So I wrote some code to read the output from the previous file, grab the glosses, and add them to the file.

Listing 2

import sys


with open('gloss-dict.tab', 'r', encoding="UTF-8") as f:
  for l in f:
    parts = l.strip().split("\t", maxsplit=2)
    if len(parts) > 1:
        GLOSSES[parts[0]] = parts[1]

ARGS = sys.argv[1:]

with open(ARGS[0], 'r', encoding="UTF-8") as f:
    with open(ARGS[1], 'w', encoding="UTF-8") as g:
        for l in f:
            word = l.strip().split("\t", maxsplit=1)
            if word[0] in GLOSSES:
                rest = "\t".join(word[1:])
                print(f"{word[0]}\t{GLOSSES[word[0]]}\t{rest}", file=g)

I printed the resulting file out and I'm off reading. It's nice to a have cheat sheet of less common vocab for the chapter.