A Mind for Language

Dec 10, 2019

Fun with vocab-tools: vocab info for a book

More fun with James Tauber's vocabulary-tools. I'm trying to read the whole NT in Greek. Titus is next. I started reading it, but there was a lot of unfamiliar vocab or at least vocab I didn't feel certain of. Vocabulary-tools to the rescue again. Sure I could buy a readers Greek New Testament, but where's the fun in that? Also using vocabulary tools lets me customize what words are added to the list.

from collections import Counter
from gnt_data import get_tokens, get_tokens_by_chunk, TokenType, ChunkType
from abott_glosser import Glosser
from ref_tools import get_book
import sys

# Get all lemmas in GNT
gnt_lemmas = Counter(get_tokens(TokenType.lemma))

# Get lemmas for chapter
NEW_CHAPTER = Counter(get_tokens(TokenType.lemma, ChunkType.book, get_book("TIT", 60)))

# get GNT freq, rather than freq in current chatper
def getNTFreq(nt, tgt):
    out = {}
    for t in tgt.items():
        lemma = t[0]
        if lemma in nt:
            out[lemma] = nt[lemma]
    return out

#subtract vocab from the last chatper from list
ACT_NT_FREQ = getNTFreq(gnt_lemmas, NEW_CHAPTER)

# Filter lemmas based on those that occur less than LIM in the GNT as a whole
LIM = 10
freq = lambda x: int(x[1]) < LIM
TGT = sorted(list(filter(freq,ACT_NT_FREQ.items())), key=lambda x: x[0])

# setup glosser
glosser = Glosser("custom-glosses.tab")

# output results
for l in TGT:
    print(f"{l[0]}\t{l[1]}\t{glosser.get(l[0])}")

By running py get_chatper.py > titus_vocab.txt I now have a vocab list. Now I can print the list and stick it in my GNT for easy access. In theory I could also keep track of this list and filter these out when I move on to the next book. Or filter out those that I have only seen a certain number of times. Also by tweaking the print line to print(f"{l[0]}\t{glosser.get(l[0])}"), the file could be imported into Anki and boom! Instant flashcards.

Nov 22, 2019

Fun with vocab-tools: comparing chapter vocab and glossing it

More fun with James Tauber's vocabulary-tools. So what if you're reading through an NT book chapter by chapter and you wonder what new vocab you're likely to encounter in the next chapter that wasn't in the previous one? Vocabulary-tools can help you figure that out.

Vocabulary-tools doesn't include a glossing tool (as far as I know), but here is a simple one based on a gloss list from the Abbott-Smith NT Greek lexicon (which you can get here).

from greek_normalisation.utils import nfc

class Glosser():
    def __init__(self):
        self.data = dict()
        with open("gloss-dict.tab", 'r', encoding="UTF-8") as f:
            for line in f:
                parts = line.split("\t", maxsplit=1)
                if len(parts) > 1:
                    self.data[nfc(parts[0])] = parts[1]

    def get(self, l):
        normed = nfc(l)
        if normed in self.data:
            return self.data[normed]
        else:
            print(f"{normed} not found in Abott Smith")
            return ''

Now we can combine that with the following code and run by typing py analyze_chapter.py <new-cpt-num> and it will print out a list of words that occur less than LIM times in the NT, the number of occurrences, and the gloss from Abbott-Smith (if found). I'm currently reading Acts; if you want a different book, then you'll need to replace BOOK_ABBV['ACT'] with the book code for the book you want to read. You can figure out this code from the vocabular-tools module.

from collections import Counter
from gnt_data import get_tokens, get_tokens_by_chunk, TokenType, ChunkType
from abott_glosser import Glosser
import sys

new_cpt = int(sys.argv[1])
BOOK_ABBV = {"GLA": "69", "1JN" : "83", "ACT": "65"}

# Get all lemmas in GNT
gnt_lemmas = Counter(get_tokens(TokenType.lemma))

# format last chatper marker
last_cpt = "0" + str(new_cpt -1) if new_cpt -1 < 10 else str(new_cpt -1)

# Get lemmmas for current and previous chapters
LAST_CHAPTER =  Counter(get_tokens(TokenType.lemma, ChunkType.chapter, BOOK_ABBV['ACT']+ last_cpt))
NEW_CHAPTER = Counter(get_tokens(TokenType.lemma, ChunkType.chapter, BOOK_ABBV['ACT']+ str(new_cpt)))

# get GNT freq, rather than freq in current chatper
def getNTFreq(nt, tgt):
    out = {}
    for t in tgt.items():
        lemma = t[0]
        if lemma in nt:
            out[lemma] = nt[lemma]
    return out

#subtract vocab from the last chatper from list
ACT_NT_FREQ = getNTFreq(gnt_lemmas, NEW_CHAPTER - LAST_CHAPTER)

# Filter lemmas based on those that occur less than LIM in the GNT as a whole
LIM = 10
freq = lambda x: int(x[1]) < LIM
TGT = sorted(list(filter(freq,ACT_NT_FREQ.items())), key=lambda x: x[0])

print(len(TGT))

# setup glosser
glosser = Glosser()

# output results
for l in TGT:
    print(f"{l[0]}\t{l[1]}\t{glosser.get(l[0])}")

Running py analyze_chapter.py 11 on Acts 11 got the following output.

21
Κλαύδιος        3       Claudius | C. Lysias

Κυρηναῖος       6       of Cyrene | a Cyrenæan

Κύπριος 3       of Cyprus | Cyprian

Κύπρος  5       Cyprus

Στέφανος        7       Stephen

Ταρσός  3       Tarsus

Φοινίκη not found in Abott Smith
Φοινίκη 3
Χριστιανός      3       a Christian

διασπείρω       3       to scatter abroad, disperse

εὐπορέομαι not found in Abott Smith
εὐπορέομαι      1
καθεξῆς 5       successively | in order | afterwards

προσμένω        7       to wait longer | continue | remain still | to remain with | to remain attached to | cleave unto | abide in

πρώτως  1       first

σημαίνω 6       to give a sign, signify, indicate

ἀναζητέω        3       to look for | seek carefully

ἀνασπάω 2       to draw up

Ἅγαβος not found in Abott Smith
Ἅγαβος  2
ἐκτίθημι        4       to set out, expose | to set forth, expound

Ἑλληνιστής      3       a Hellenist |  Grecian Jew

ἡσυχάζω 5       to be still | to rest from labour | to live quietly | to be silent

ἴσος    8       equal | the same

Nov 19, 2019

More fun with JTauber's vocab tools: Finding verses and pericopes with shared vocab

Vocabulary acquisition requires repeated exposure to a word in order for our brains to acquire that word. In other words, we need encounter a given word repeatedly to acquire it. Reading texts that cover similar topics is a great way to do this. Since the topic is similar, the likelihood that there will be repeated vocabulary between the texts is higher.

For those of us interested in New Testament Greek and acquiring vocabulary, reading the GNT would be a good way to do this. Read the whole thing and you will certainly have acquired a good deal of vocab. But sometimes, biblical texts don't address the same topic with enough repetition for us to naturally get the repeated exposure we need to acquire a word within a short period of time.

What if we could read passages that have a high degree of shared vocab? That should provide the repetition. But how do we find these passages?

Enter the dragon... I mean, enter James Tauber's vocabulary tools for the GNT.

The code

The following code loops through each verse in the GNT and then gets set of all lemmas found there. It then loops through every verse in the GNT, and figures out what lemmas are not common to those two verses. If the number lemmas that aren't shared is below a given limit (in this case 5), it saves them to be output.

from gnt_data import get_tokens, get_tokens_by_chunk, TokenType, ChunkType
from greekutils.verse_ref import bcv_to_verse_ref

reffer = lambda x: bcv_to_verse_ref(x, start=61)


gnt_verses = get_tokens_by_chunk(TokenType.lemma, ChunkType.verse)

commons = dict()
LIM = 5
for verse, lemma in gnt_verses.items():
    print(reffer(verse))
    verse_set = set(lemma)
    for v, l in gnt_verses.items():
        if v == verse:
            continue
        vset = set(l)
        u = verse_set.union(vset)
        intr = verse_set.intersection(vset)
        not_common = u - intr
        if len(not_common) < LIM:
            if verse in commons:
                commons[verse].append(v)
            else:
                commons[verse] = [v]
with open("common_list_verses.txt", 'w') as g:
    for k,v in commons.items():
        print(reffer(k), file=g)
        for i in v:
            print("\t" + reffer(i), file=g)
print("DONE!")

Here's a snippet of the results:

Matt 4:14
    Matt 2:17
    Matt 12:17
    Matt 21:4

Now let's compare them (Greek text taken from [1]):

Matt 4:14 is ἵνα πληρωθῇ τὸ ῥηθὲν διὰ Ἠσαΐου τοῦ προφήτου λέγοντος·

  • Matt 2:17 – τότε ἐπληρώθη τὸ ῥηθὲν ⸀διὰ Ἰερεμίου τοῦ προφήτου λέγοντος
  • Matt 12:17 – ⸀ἵνα πληρωθῇ τὸ ῥηθὲν διὰ Ἠσαΐου τοῦ προφήτου λέγοντος·
  • Matt 21:4 –Τοῦτο ⸀δὲ γέγονεν ἵνα πληρωθῇ τὸ ῥηθὲν διὰ τοῦ προφήτου λέγοντος·

What about larger units of text

Ok, but who wants to skip around reading random verses? By making a few tweaks to the code above we can compare pericopes.

gnt_verses = get_tokens_by_chunk(TokenType.lemma, ChunkType.pericope)
...

LIM = 10
...
with open("common_list_pericope.txt", 'w') as g:
    for k,v in commons.items():
        print(k, file=g)
        for i in v:
            print("\t" + i, file=g)

Which returns the following passages. I had to write some extra code to convert the pericope codes into the normal passage references so you'll want this file and this file if you want to run this part yourself.

Mark 10:13 - Mark 10:16
    Luke 18:15 - Luke 18:17
Luke 18:15 - Luke 18:17
    Mark 10:13 - Mark 10:16
Eph 1:1 - Eph 1:2
    Col 1:1 - Col 1:2
Col 1:1 - Col 1:2
    Eph 1:1 - Eph 1:2

By changing LIM to 15 we get the following list.

Mark 10:13 - Mark 10:16
    Luke 18:15 - Luke 18:17
Luke 18:15 - Luke 18:17
    Mark 10:13 - Mark 10:16
Eph 1:1 - Eph 1:2
    Phil 1:1 - Phil 1:2
    Col 1:1 - Col 1:2
    2 Thess 1:1 - 2 Thess 1:2
    2 Tim 1:1 - 2 Tim 1:2
Phil 1:1 - Phil 1:2
    Eph 1:1 - Eph 1:2
    Col 1:1 - Col 1:2
    2 Thess 1:1 - 2 Thess 1:2
Col 1:1 - Col 1:2
    Eph 1:1 - Eph 1:2
    Phil 1:1 - Phil 1:2
    2 Thess 1:1 - 2 Thess 1:2
    2 Tim 1:1 - 2 Tim 1:2
2 Thess 1:1 - 2 Thess 1:2
    Eph 1:1 - Eph 1:2
    Phil 1:1 - Phil 1:2
    Col 1:1 - Col 1:2
    1 Tim 1:1 - 1 Tim 1:2
    2 Tim 1:1 - 2 Tim 1:2
    Phlm 1:1 - Phlm 1:3
1 Tim 1:1 - 1 Tim 1:2
    2 Thess 1:1 - 2 Thess 1:2
    2 Tim 1:1 - 2 Tim 1:2
2 Tim 1:1 - 2 Tim 1:2
    Eph 1:1 - Eph 1:2
    Col 1:1 - Col 1:2
    2 Thess 1:1 - 2 Thess 1:2
    1 Tim 1:1 - 1 Tim 1:2
Phlm 1:1 - Phlm 1:3
    2 Thess 1:1 - 2 Thess 1:2

κ.τ.λ.

ChunkType could also be changed to chapter if you'd like to compare chapters.

All of the above uses lemmas. If you are interested in forms, then simply replacing TokenType.lemma form TokenType.form in this line will do the trick.

gnt_verses = get_tokens_by_chunk(TokenType.form, ChunkType.pericope)

I doubt this will change your life as a student or as a teacher, but it is certainly interesting to know which verses or passages share vocabulary. This could help us develop better reading assignments for students or direct us to which passages could be interesting reading to grow our own vocabulary.


[1]: Michael W. Holmes, The Greek New Testament: SBL Edition (Lexham Press; Society of Biblical Literature, 2011–2013)

Nov 14, 2019

Fun with James Tauber's vocabulary tools

James Tauber has written a set of vocabulary tools for the Greek New Testament (GNT).

I wanted to read Acts 10 and thought I'd see what the words occur there that occur less than 10 times in the GNT overall. The code in Listing 1 will get that list and print the word and its total GNT count to a text file.

Listing 1

from collections import Counter
from gnt_data import get_tokens, get_tokens_by_chunk, TokenType, ChunkType
import pprint

BOOK_ABBV = {"GLA": "69", "1JN" : "83", "ACT": "65"}

gnt_lemmas = Counter(get_tokens(TokenType.lemma))

ACT_10_lemmas = Counter(get_tokens(TokenType.lemma, ChunkType.chapter, BOOK_ABBV['ACT']+ "10"))


def getNTFreq(nt, tgt):
    out = {}
    for t in tgt.items():
        lemma = t[0]
        if lemma in nt:
            out[lemma] = nt[lemma]
    return out

ACT_NT_FREQ = getNTFreq(gnt_lemmas, ACT_10_lemmas)

freq = lambda x: int(x[1]) < 10



TGT = sorted(list(filter(freq,ACT_NT_FREQ.items())), key=lambda x: x[0])

pprint.pprint(TGT)

print(len(TGT))

with open("act_10.txt", 'w', encoding="UTF-8") as f:
    for l in TGT:
        print(f"{l[0]}\t\t{l[1]}", file=f)
print("Done!")

I then wanted the glosses for these words.

I have a list of glosses extracted from the Abbot-Smith NT Greek lexicon (the list is available here). So I wrote some code to read the output from the previous file, grab the glosses, and add them to the file.

Listing 2

import sys

GLOSSES = {}

with open('gloss-dict.tab', 'r', encoding="UTF-8") as f:
  for l in f:
    parts = l.strip().split("\t", maxsplit=2)
    if len(parts) > 1:
        GLOSSES[parts[0]] = parts[1]

ARGS = sys.argv[1:]

with open(ARGS[0], 'r', encoding="UTF-8") as f:
    with open(ARGS[1], 'w', encoding="UTF-8") as g:
        for l in f:
            word = l.strip().split("\t", maxsplit=1)
            if word[0] in GLOSSES:
                rest = "\t".join(word[1:])
                print(f"{word[0]}\t{GLOSSES[word[0]]}\t{rest}", file=g)

I printed the resulting file out and I'm off reading. It's nice to a have cheat sheet of less common vocab for the chapter.