Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast n-gram calculation

I'm using NLTK to search for n-grams in a corpus but it's taking a very long time in some cases. I've noticed calculating n-grams isn't an uncommon feature in other packages (apparently Haystack has some functionality for it). Does this mean there's a potentially faster way of finding n-grams in my corpus if I abandon NLTK? If so, what can I use to speed things up?

like image 686
Trindaz Avatar asked Sep 29 '11 00:09

Trindaz


People also ask

How do you calculate n-grams?

Calculating n-gram Probability With this small corpus we only count one occurrence of each n-gram. By dividing these counts by the size of all n-grams in our list we would get a probability of 0.5 of each n-gram occurring. Let's look a larger corpus of words and see what the probabilities can tell us.

What is n-gram frequency?

From Glottopedia. The mean, or summed, frequency of all fragments of a word of a given length. Most commonly used is bigram frequency, using fragments of length 2.

What is n-gram model with example?

An N-gram means a sequence of N words. So for example, “Medium blog” is a 2-gram (a bigram), “A Medium blog post” is a 4-gram, and “Write on Medium” is a 3-gram (trigram).

How does bigram work?

The Bigram ModelThis assumption that the probability of a word depends only on the previous word is also known as Markov assumption. Markov models are the class of probabilisitic models that assume that we can predict the probability of some future unit without looking too far in the past.


2 Answers

Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality.

I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself.

def ngrams(tokens, MIN_N, MAX_N):
    n_tokens = len(tokens)
    for i in xrange(n_tokens):
        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
            yield tokens[i:j]

Then replace the yield with the actual action you want to take on each n-gram (add it to a dict, store it in a database, whatever) to get rid of the generator overhead.

Finally, if it's really not fast enough, convert the above to Cython and compile it. Example using a defaultdict instead of yield:

def ngrams(tokens, int MIN_N, int MAX_N):
    cdef Py_ssize_t i, j, n_tokens

    count = defaultdict(int)

    join_spaces = " ".join

    n_tokens = len(tokens)
    for i in xrange(n_tokens):
        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
            count[join_spaces(tokens[i:j])] += 1

    return count
like image 69
Fred Foo Avatar answered Oct 13 '22 06:10

Fred Foo


You might find a pythonic, elegant and fast ngram generation function using zip and splat (*) operator here :

def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])
like image 39
Wxds Avatar answered Oct 13 '22 06:10

Wxds