# Means, median distance between lemma clusters

Fletcher Hardison

I was reading a Koine Greek text last night and I realized that I was able to get a handle on certain new vocabulary because it was frequent within the section I was reading. I have no idea if it is frequent within the work as a whole, but the distance between encounters with the word short pretty short in the section I was reading. This got me wondering if this idea of distance would be helpful as a metric for vocabulary teaching/learning.

I wrote some code using the gnt_data from JTauber’s vocabulary-tools. It’s available in this gist.

The two tables below show the results of 25th-50th rows of the output. The first is sorted by mean and second by median. Looking at the output, it seems that the mean distance follows overall frequency (total). This which makes sense. It is, however, not the case for the median distance.

Notice that the median does not follow frequency at all. The total number of occurrences and mean vary widely. The lower the median, the higher the probably that a word occurs multiple times close together. So we might expect a word like θηρίον to be rare within the corpus, but in certain sections it might be common.

Just for kicks here is the median data, but this time showing rows 0 through 25.

In the end is this data particularly useful? Probably not. It could be helpful as a starting point to see if a word tends to cluster. If so, then it should be possible to use vocab-tools to find which pericopes, paragraphs, chapters etc. where the word clusters. This could be helpful in finding readings so that we can use a familiar corpus such as the GNT to teach words that might be rare within that corpus while being more common within the wider Greek corpus. Of course this could all be hogwash. Hopefully, I can write the code to find pericopes where the words cluster.