I'm using NLTK to search for n-grams in a corpus but it's taking a very long time in some cases. I've noticed calculating n-grams isn't an uncommon feature in other packages (apparently Haystack has some functionality for it). Does this mean there's a potentially faster way of finding n-grams in my corpus if I abandon NLTK? If so, what can I use to speed things up?
Calculating n-gram Probability With this small corpus we only count one occurrence of each n-gram. By dividing these counts by the size of all n-grams in our list we would get a probability of 0.5 of each n-gram occurring. Let's look a larger corpus of words and see what the probabilities can tell us.
From Glottopedia. The mean, or summed, frequency of all fragments of a word of a given length. Most commonly used is bigram frequency, using fragments of length 2.
An N-gram means a sequence of N words. So for example, “Medium blog” is a 2-gram (a bigram), “A Medium blog post” is a 4-gram, and “Write on Medium” is a 3-gram (trigram).
The Bigram ModelThis assumption that the probability of a word depends only on the previous word is also known as Markov assumption. Markov models are the class of probabilisitic models that assume that we can predict the probability of some future unit without looking too far in the past.
Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality.
I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself.
def ngrams(tokens, MIN_N, MAX_N):
n_tokens = len(tokens)
for i in xrange(n_tokens):
for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
yield tokens[i:j]
Then replace the yield
with the actual action you want to take on each n-gram (add it to a dict
, store it in a database, whatever) to get rid of the generator overhead.
Finally, if it's really not fast enough, convert the above to Cython and compile it. Example using a defaultdict
instead of yield
:
def ngrams(tokens, int MIN_N, int MAX_N):
cdef Py_ssize_t i, j, n_tokens
count = defaultdict(int)
join_spaces = " ".join
n_tokens = len(tokens)
for i in xrange(n_tokens):
for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
count[join_spaces(tokens[i:j])] += 1
return count
You might find a pythonic, elegant and fast ngram generation function using zip
and splat (*) operator here :
def find_ngrams(input_list, n):
return zip(*[input_list[i:] for i in range(n)])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With