Fast n-gram calculation

I'm using NLTK to search for n-grams in a corpus but it's taking a very long time in some cases. I've noticed calculating n-grams isn't an uncommon feature in other packages (apparently Haystack has some functionality for it). Does this mean there's a potentially faster way of finding n-grams in my corpus if I abandon NLTK? If so, what can I use to speed things up?

How do you calculate n-grams?

Calculating n-gram Probability With this small corpus we only count one occurrence of each n-gram. By dividing these counts by the size of all n-grams in our list we would get a probability of 0.5 of each n-gram occurring. Let's look a larger corpus of words and see what the probabilities can tell us.

What is n-gram frequency?

From Glottopedia. The mean, or summed, frequency of all fragments of a word of a given length. Most commonly used is bigram frequency, using fragments of length 2.

What is n-gram model with example?

An N-gram means a sequence of N words. So for example, “Medium blog” is a 2-gram (a bigram), “A Medium blog post” is a 4-gram, and “Write on Medium” is a 3-gram (trigram).

How does bigram work?

The Bigram ModelThis assumption that the probability of a word depends only on the previous word is also known as Markov assumption. Markov models are the class of probabilisitic models that assume that we can predict the probability of some future unit without looking too far in the past.

Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality.

I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself.

def ngrams(tokens, MIN_N, MAX_N):
    n_tokens = len(tokens)
    for i in xrange(n_tokens):
        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
            yield tokens[i:j]

Then replace the yield with the actual action you want to take on each n-gram (add it to a dict, store it in a database, whatever) to get rid of the generator overhead.

Finally, if it's really not fast enough, convert the above to Cython and compile it. Example using a defaultdict instead of yield:

def ngrams(tokens, int MIN_N, int MAX_N):
    cdef Py_ssize_t i, j, n_tokens

    count = defaultdict(int)

    join_spaces = " ".join

    n_tokens = len(tokens)
    for i in xrange(n_tokens):
        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
            count[join_spaces(tokens[i:j])] += 1

    return count

You might find a pythonic, elegant and fast ngram generation function using zip and splat (*) operator here :

def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])

Fast n-gram calculation

Tags:

python

nlp

nltk

n-gram

Trindaz

People also ask

2 Answers

Fred Foo

Wxds

Recent Activity

Donate For Us

Fast n-gram calculation

Tags:

python

nlp

nltk

n-gram

Trindaz

People also ask

2 Answers

Fred Foo

Wxds

Related questions

Recent Activity

Donate For Us