Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a more efficient way to find most common n-grams?

I'm trying to find k most common n-grams from a large corpus. I've seen lots of places suggesting the naïve approach - simply scanning through the entire corpus and keeping a dictionary of the count of all n-grams. Is there a better way to do this?

like image 973
bendl Avatar asked Feb 21 '17 17:02

bendl


People also ask

What is the best ngram range?

Based on the results, the model performs at its best with the n-gram range of (1,5). This means that training the model with n-grams ranging from unigrams to 5-grams help achieve optimal results, but larger n-grams only result in more sparse input features, which hampers model performance.

What is a good usage of n-gram analysis?

Well, in Natural Language Processing, or NLP for short, n-grams are used for a variety of things. Some examples include auto completion of sentences (such as the one we see in Gmail these days), auto spell check (yes, we can do that as well), and to a certain extent, we can check for grammar in a given sentence.

What is n-gram technique?

An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model.

What is n-gram feature extraction?

N-grams are the basic features commonly used in sequence-based malicious code detection methods in computer virology research.


1 Answers

In Python, using NLTK:

$ wget http://norvig.com/big.txt
$ python
>>> from collections import Counter
>>> from nltk import ngrams
>>> bigtxt = open('big.txt').read()
>>> ngram_counts = Counter(ngrams(bigtxt.split(), 2))
>>> ngram_counts.most_common(10)
[(('of', 'the'), 12422), (('in', 'the'), 5741), (('to', 'the'), 4333), (('and', 'the'), 3065), (('on', 'the'), 2214), (('at', 'the'), 1915), (('by', 'the'), 1863), (('from', 'the'), 1754), (('of', 'a'), 1700), (('with', 'the'), 1656)]

In Python, native (see Fast/Optimize N-gram implementations in python):

>>> import collections
>>> def ngrams(text, n=2):
...     return zip(*[text[i:] for i in range(n)])
>>> ngram_counts = collections.Counter(ngrams(bigtxt.split(), 2))
>>> ngram_counts.most_common(10)
    [(('of', 'the'), 12422), (('in', 'the'), 5741), (('to', 'the'), 4333), (('and', 'the'), 3065), (('on', 'the'), 2214), (('at', 'the'), 1915), (('by', 'the'), 1863), (('from', 'the'), 1754), (('of', 'a'), 1700), (('with', 'the'), 1656)]

In Julia, see Generate ngrams with Julia

import StatsBase: countmap
import Iterators: partition
bigtxt = readstring(open("big.txt"))
ngram_counts = countmap(collect(partition(split(bigtxt), 2, 1)))

Rough timing:

$ time python ngram-test.py # With NLTK.

real    0m3.166s
user    0m2.274s
sys 0m0.528s

$ time python ngram-native-test.py 

real    0m1.521s
user    0m1.317s
sys 0m0.145s

$ time julia ngram-test.jl 

real    0m3.573s
user    0m3.188s
sys 0m0.306s
like image 90
alvas Avatar answered Nov 16 '22 01:11

alvas