A Mind for Language

Dec 10, 2019

Fun with vocab-tools: vocab info for a book

More fun with James Tauber's vocabulary-tools. I'm trying to read the whole NT in Greek. Titus is next. I started reading it, but there was a lot of unfamiliar vocab or at least vocab I didn't feel certain of. Vocabulary-tools to the rescue again. Sure I could buy a readers Greek New Testament, but where's the fun in that? Also using vocabulary tools lets me customize what words are added to the list.

from collections import Counter
from gnt_data import get_tokens, get_tokens_by_chunk, TokenType, ChunkType
from abott_glosser import Glosser
from ref_tools import get_book
import sys

# Get all lemmas in GNT
gnt_lemmas = Counter(get_tokens(TokenType.lemma))

# Get lemmas for chapter
NEW_CHAPTER = Counter(get_tokens(TokenType.lemma, ChunkType.book, get_book("TIT", 60)))

# get GNT freq, rather than freq in current chatper
def getNTFreq(nt, tgt):
    out = {}
    for t in tgt.items():
        lemma = t[0]
        if lemma in nt:
            out[lemma] = nt[lemma]
    return out

#subtract vocab from the last chatper from list
ACT_NT_FREQ = getNTFreq(gnt_lemmas, NEW_CHAPTER)

# Filter lemmas based on those that occur less than LIM in the GNT as a whole
LIM = 10
freq = lambda x: int(x[1]) < LIM
TGT = sorted(list(filter(freq,ACT_NT_FREQ.items())), key=lambda x: x[0])

# setup glosser
glosser = Glosser("custom-glosses.tab")

# output results
for l in TGT:
    print(f"{l[0]}\t{l[1]}\t{glosser.get(l[0])}")

By running py get_chatper.py > titus_vocab.txt I now have a vocab list. Now I can print the list and stick it in my GNT for easy access. In theory I could also keep track of this list and filter these out when I move on to the next book. Or filter out those that I have only seen a certain number of times. Also by tweaking the print line to print(f"{l[0]}\t{glosser.get(l[0])}"), the file could be imported into Anki and boom! Instant flashcards.

Nov 22, 2019

Fun with vocab-tools: comparing chapter vocab and glossing it

More fun with James Tauber's vocabulary-tools. So what if you're reading through an NT book chapter by chapter and you wonder what new vocab you're likely to encounter in the next chapter that wasn't in the previous one? Vocabulary-tools can help you figure that out.

Vocabulary-tools doesn't include a glossing tool (as far as I know), but here is a simple one based on a gloss list from the Abbott-Smith NT Greek lexicon (which you can get here).

from greek_normalisation.utils import nfc

class Glosser():
    def __init__(self):
        self.data = dict()
        with open("gloss-dict.tab", 'r', encoding="UTF-8") as f:
            for line in f:
                parts = line.split("\t", maxsplit=1)
                if len(parts) > 1:
                    self.data[nfc(parts[0])] = parts[1]

    def get(self, l):
        normed = nfc(l)
        if normed in self.data:
            return self.data[normed]
        else:
            print(f"{normed} not found in Abott Smith")
            return ''

Now we can combine that with the following code and run by typing py analyze_chapter.py <new-cpt-num> and it will print out a list of words that occur less than LIM times in the NT, the number of occurrences, and the gloss from Abbott-Smith (if found). I'm currently reading Acts; if you want a different book, then you'll need to replace BOOK_ABBV['ACT'] with the book code for the book you want to read. You can figure out this code from the vocabular-tools module.

from collections import Counter
from gnt_data import get_tokens, get_tokens_by_chunk, TokenType, ChunkType
from abott_glosser import Glosser
import sys

new_cpt = int(sys.argv[1])
BOOK_ABBV = {"GLA": "69", "1JN" : "83", "ACT": "65"}

# Get all lemmas in GNT
gnt_lemmas = Counter(get_tokens(TokenType.lemma))

# format last chatper marker
last_cpt = "0" + str(new_cpt -1) if new_cpt -1 < 10 else str(new_cpt -1)

# Get lemmmas for current and previous chapters
LAST_CHAPTER =  Counter(get_tokens(TokenType.lemma, ChunkType.chapter, BOOK_ABBV['ACT']+ last_cpt))
NEW_CHAPTER = Counter(get_tokens(TokenType.lemma, ChunkType.chapter, BOOK_ABBV['ACT']+ str(new_cpt)))

# get GNT freq, rather than freq in current chatper
def getNTFreq(nt, tgt):
    out = {}
    for t in tgt.items():
        lemma = t[0]
        if lemma in nt:
            out[lemma] = nt[lemma]
    return out

#subtract vocab from the last chatper from list
ACT_NT_FREQ = getNTFreq(gnt_lemmas, NEW_CHAPTER - LAST_CHAPTER)

# Filter lemmas based on those that occur less than LIM in the GNT as a whole
LIM = 10
freq = lambda x: int(x[1]) < LIM
TGT = sorted(list(filter(freq,ACT_NT_FREQ.items())), key=lambda x: x[0])

print(len(TGT))

# setup glosser
glosser = Glosser()

# output results
for l in TGT:
    print(f"{l[0]}\t{l[1]}\t{glosser.get(l[0])}")

Running py analyze_chapter.py 11 on Acts 11 got the following output.

21
Κλαύδιος        3       Claudius | C. Lysias

Κυρηναῖος       6       of Cyrene | a Cyrenæan

Κύπριος 3       of Cyprus | Cyprian

Κύπρος  5       Cyprus

Στέφανος        7       Stephen

Ταρσός  3       Tarsus

Φοινίκη not found in Abott Smith
Φοινίκη 3
Χριστιανός      3       a Christian

διασπείρω       3       to scatter abroad, disperse

εὐπορέομαι not found in Abott Smith
εὐπορέομαι      1
καθεξῆς 5       successively | in order | afterwards

προσμένω        7       to wait longer | continue | remain still | to remain with | to remain attached to | cleave unto | abide in

πρώτως  1       first

σημαίνω 6       to give a sign, signify, indicate

ἀναζητέω        3       to look for | seek carefully

ἀνασπάω 2       to draw up

Ἅγαβος not found in Abott Smith
Ἅγαβος  2
ἐκτίθημι        4       to set out, expose | to set forth, expound

Ἑλληνιστής      3       a Hellenist |  Grecian Jew

ἡσυχάζω 5       to be still | to rest from labour | to live quietly | to be silent

ἴσος    8       equal | the same

Nov 14, 2019

Fun with James Tauber's vocabulary tools

James Tauber has written a set of vocabulary tools for the Greek New Testament (GNT).

I wanted to read Acts 10 and thought I'd see what the words occur there that occur less than 10 times in the GNT overall. The code in Listing 1 will get that list and print the word and its total GNT count to a text file.

Listing 1

from collections import Counter
from gnt_data import get_tokens, get_tokens_by_chunk, TokenType, ChunkType
import pprint

BOOK_ABBV = {"GLA": "69", "1JN" : "83", "ACT": "65"}

gnt_lemmas = Counter(get_tokens(TokenType.lemma))

ACT_10_lemmas = Counter(get_tokens(TokenType.lemma, ChunkType.chapter, BOOK_ABBV['ACT']+ "10"))


def getNTFreq(nt, tgt):
    out = {}
    for t in tgt.items():
        lemma = t[0]
        if lemma in nt:
            out[lemma] = nt[lemma]
    return out

ACT_NT_FREQ = getNTFreq(gnt_lemmas, ACT_10_lemmas)

freq = lambda x: int(x[1]) < 10



TGT = sorted(list(filter(freq,ACT_NT_FREQ.items())), key=lambda x: x[0])

pprint.pprint(TGT)

print(len(TGT))

with open("act_10.txt", 'w', encoding="UTF-8") as f:
    for l in TGT:
        print(f"{l[0]}\t\t{l[1]}", file=f)
print("Done!")

I then wanted the glosses for these words.

I have a list of glosses extracted from the Abbot-Smith NT Greek lexicon (the list is available here). So I wrote some code to read the output from the previous file, grab the glosses, and add them to the file.

Listing 2

import sys

GLOSSES = {}

with open('gloss-dict.tab', 'r', encoding="UTF-8") as f:
  for l in f:
    parts = l.strip().split("\t", maxsplit=2)
    if len(parts) > 1:
        GLOSSES[parts[0]] = parts[1]

ARGS = sys.argv[1:]

with open(ARGS[0], 'r', encoding="UTF-8") as f:
    with open(ARGS[1], 'w', encoding="UTF-8") as g:
        for l in f:
            word = l.strip().split("\t", maxsplit=1)
            if word[0] in GLOSSES:
                rest = "\t".join(word[1:])
                print(f"{word[0]}\t{GLOSSES[word[0]]}\t{rest}", file=g)

I printed the resulting file out and I'm off reading. It's nice to a have cheat sheet of less common vocab for the chapter.