Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate bigrams with NLTK

I am trying to produce a bigram list of a given sentence for example, if I type,

    To be or not to be

I want the program to generate

     to be, be or, or not, not to, to be

I tried the following code but just gives me

<generator object bigrams at 0x0000000009231360>

This is my code:

    import nltk
    bigrm = nltk.bigrams(text)
    print(bigrm)

So how do I get what I want? I want a list of combinations of the words like above (to be, be or, or not, not to, to be).

like image 473
Nikhil Raghavendra Avatar asked Jun 06 '16 06:06

Nikhil Raghavendra


3 Answers

nltk.bigrams() returns an iterator (a generator specifically) of bigrams. If you want a list, pass the iterator to list(). It also expects a sequence of items to generate bigrams from, so you have to split the text before passing it (if you had not done it):

bigrm = list(nltk.bigrams(text.split())) 

To print them out separated with commas, you could (in python 3):

print(*map(' '.join, bigrm), sep=', ') 

If on python 2, then for example:

print ', '.join(' '.join((a, b)) for a, b in bigrm) 

Note that just for printing you do not need to generate a list, just use the iterator.

like image 90
Ilja Everilä Avatar answered Sep 21 '22 16:09

Ilja Everilä


The following code produce a bigram list for a given sentence

>>> import nltk
>>> from nltk.tokenize import word_tokenize
>>> text = "to be or not to be"
>>> tokens = nltk.word_tokenize(text)
>>> bigrm = nltk.bigrams(tokens)
>>> print(*map(' '.join, bigrm), sep=', ')
to be, be or, or not, not to, to be
like image 35
Ashok Kumar Jayaraman Avatar answered Sep 22 '22 16:09

Ashok Kumar Jayaraman


Quite late, but this is another way.

>>> from nltk.util import ngrams
>>> text = "I am batman and I like coffee"
>>> _1gram = text.split(" ")
>>> _2gram = [' '.join(e) for e in ngrams(_1gram, 2)]
>>> _3gram = [' '.join(e) for e in ngrams(_1gram, 3)]
>>> 
>>> _1gram
['I', 'am', 'batman', 'and', 'I', 'like', 'coffee']
>>> _2gram
['I am', 'am batman', 'batman and', 'and I', 'I like', 'like coffee']
>>> _3gram
['I am batman', 'am batman and', 'batman and I', 'and I like', 'I like coffee']
like image 44
Shashwat Avatar answered Sep 22 '22 16:09

Shashwat