Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I want to make a dictionary of trigrams out of a text file, but something is wrong and I do not know what it is

I have written a program which is counting trigrams that occur 5 times or more in a text file. The trigrams should be printed out according to their frequency.

I cannot find the problem!

I get the following error message:

list index out of range

I have tried to make the range bigger but that did not work out

f = open("bsp_file.txt", encoding="utf-8")
text = f.read()
f.close()



words = []

for word in text.split():
    word = word.strip(",.:;-?!-–—_ ")
    if len(word) != 0:
        words.append(word)

trigrams = {}
for i in range(len(words)):
    word = words[i]
    nextword = words[i + 1]
    nextnextword = words[i + 2]
    key = (word, nextword, nextnextword)
    trigrams[key] = trigrams.get(key, 0) + 1   

l = list(trigrams.items())
l.sort(key=lambda x: x[1])
l.reverse()

for key, count in l:
    if count < 5:
        break
    word = key[0]
    nextword = key[1]
    nextnextword = key[2]
    print(word, nextword, nextnextword, count)

The result should look like this:(simplified)

s = "this is a trigram which is an example............."
this is a
is a trigram
a trigram which
trigram which is
which is an
is an example
like image 452
besch150 Avatar asked Nov 30 '25 00:11

besch150


1 Answers

As the comments pointed out, you're iterating over your list words with i, and you try to access words[i+1], when i will reach the last cell of words, i+1 will be out of range.

I suggest you read this tutorial to generate n-grams with pure python: http://www.albertauyeung.com/post/generating-ngrams-python/


Answer

If you don't have much time to read it all here's the function I recommend adaptated from the link:

def get_ngrams_count(words, n):
    # generates a list of Tuples representing all n-grams
    ngrams_tuple = zip(*[words[i:] for i in range(n)])
    # turn the list into a dictionary with the counts of all ngrams
    ngrams_count = {}
    for ngram in ngrams_tuple:
        if ngram not in ngrams_count:
            ngrams_count[ngram] = 0
        ngrams_count[ngram] += 1
    return ngrams_count

trigrams = get_ngrams_count(words, 3)

Please note that you can make this function a lot simpler by using a Counter (which subclasses dict, so it will be compatible with your code) :

from collections import Counter


def get_ngrams_count(words, n):
    # turn the list into a dictionary with the counts of all ngrams
    return Counter(zip(*[words[i:] for i in range(n)]))

trigrams = get_ngrams_count(words, 3)

Side Notes

  • You can use the bool argument reverse in .sort() to sort your list from most common to least common:
l = list(trigrams.items())
l.sort(key=lambda x: x[1], reverse=True)

this is a tad faster than sorting your list in ascending order and then reverse it with .reverse()

  • A more generic function for the printing of your sorted list (will work for any n-grams and not just tri-grams):
for ngram, count in l:
    if count < 5:
        break
    # " ".join(ngram) will combine all elements of ngram in a string, separated with spaces
    print(" ".join(ngram), count)
like image 175
Inspi Avatar answered Dec 01 '25 17:12

Inspi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!