Given a string:
this is a test this is
How can I find the top-n most common 2-grams? In the string above, all 2-grams are:
{this is, is a, test this, this is}
As you can notice, the 2-gram this is
appears 2 times. Hence the result should be:
{this is: 2}
I know I can use Counter.most_common()
method to find the most common elements, but how can I create a list of 2-grams from the string to begin with?
You can use the method provided in this blog post to conveniently create n-grams in Python.
from collections import Counter
bigrams = zip(words, words[1:])
counts = Counter(bigrams)
print(counts.most_common())
That assumes that the input is a list of words, of course. If your input is a string like the one you provided (which does not have any punctuation), then you can do just words = text.split(' ')
to get a list of words. In general, though, you would have to take punctuation, whitespace and other non-alphabetic characters into account. In that case you might do something like
import re
words = re.findall(r'[A-Za-z]+', text)
or you could use an external library such as nltk.tokenize.
Edit. If you need tri-grams or any other any other n-grams in general then you can use the function provided in the blog post I linked to:
def find_ngrams(input_list, n):
return zip(*(input_list[i:] for i in range(n)))
trigrams = find_ngrams(words, 3)
Well, you can use
words = s.split() # s is the original string
pairs = [(words[i], words[i+1]) for i in range(len(words)-1)]
(words[i], words[i+1])
is the pair of words at place i and i+1, and we go over all pairs from (0,1) to (n-2, n-1) with n being the length of the string s.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With