The essential concepts in text mining is n-grams, which are a set of co-occurring or continuous sequence of n items from a sequence of large text or sentence. The item here could be words, letters, and syllables. 1-gram is also called as unigrams are the unique words present in the sentence.
N-grams are continuous sequences of words or symbols or tokens in a document. In technical terms, they can be defined as the neighbouring sequences of items in a document. They come into play when we deal with text data in NLP(Natural Language Processing) tasks.
An N-gram means a sequence of N words. So for example, “Medium blog” is a 2-gram (a bigram), “A Medium blog post” is a 4-gram, and “Write on Medium” is a 3-gram (trigram). Well, that wasn't very interesting or exciting. True, but we still have to look at the probability used with n-grams, which is quite interesting.
Great native python based answers given by other users. But here's the nltk
approach (just in case, the OP gets penalized for reinventing what's already existing in the nltk
library).
There is an ngram module that people seldom use in nltk
. It's not because it's hard to read ngrams, but training a model base on ngrams where n > 3 will result in much data sparsity.
from nltk import ngrams
sentence = 'this is a foo bar sentences and i want to ngramize it'
n = 6
sixgrams = ngrams(sentence.split(), n)
for grams in sixgrams:
print grams
I'm surprised that this hasn't shown up yet:
In [34]: sentence = "I really like python, it's pretty awesome.".split()
In [35]: N = 4
In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]
In [37]: for gram in grams: print gram
['I', 'really', 'like', 'python,']
['really', 'like', 'python,', "it's"]
['like', 'python,', "it's", 'pretty']
['python,', "it's", 'pretty', 'awesome.']
Using only nltk tools
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
def get_ngrams(text, n ):
n_grams = ngrams(word_tokenize(text), n)
return [ ' '.join(grams) for grams in n_grams]
Example output
get_ngrams('This is the simplest text i could think of', 3 )
['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']
In order to keep the ngrams in array format just remove ' '.join
People have already answered pretty nicely for the scenario where you need bigrams or trigrams but if you need everygram for the sentence in that case you can use nltk.util.everygrams
>>> from nltk.util import everygrams
>>> message = "who let the dogs out"
>>> msg_split = message.split()
>>> list(everygrams(msg_split))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out'), ('who', 'let', 'the'), ('let', 'the', 'dogs'), ('the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs'), ('let', 'the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs', 'out')]
Incase you have a limit like in case of trigrams where the max length should be 3 then you can use max_len param to specify it.
>>> list(everygrams(msg_split, max_len=2))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out')]
You can just modify the max_len param to achieve whatever gram i.e four gram, five gram, six or even hundred gram.
The previous mentioned solutions can be modified to implement the above mentioned solution but this solution is much straight forward than that.
For further reading click here
And when you just need a specific gram like bigram or trigram etc you can use the nltk.util.ngrams as mentioned in M.A.Hassan's answer.
here is another simple way for do n-grams
>>> from nltk.util import ngrams
>>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams"
>>> tokenize = nltk.word_tokenize(text)
>>> tokenize
['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
>>> bigrams = ngrams(tokenize,2)
>>> bigrams
[('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')]
>>> trigrams=ngrams(tokenize,3)
>>> trigrams
[('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')]
>>> fourgrams=ngrams(tokenize,4)
>>> fourgrams
[('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')]
You can easily whip up your own function to do this using itertools
:
from itertools import izip, islice, tee
s = 'spam and eggs'
N = 3
trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N))))
list(trigrams)
# [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '),
# ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'),
# ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'),
# ('e', 'g', 'g'), ('g', 'g', 's')]
A more elegant approach to build bigrams with python’s builtin zip()
.
Simply convert the original string into a list by split()
, then pass the list once normally and once offset by one element.
string = "I really like python, it's pretty awesome."
def find_bigrams(s):
input_list = s.split(" ")
return zip(input_list, input_list[1:])
def find_ngrams(s, n):
input_list = s.split(" ")
return zip(*[input_list[i:] for i in range(n)])
find_bigrams(string)
[('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With