I have a list of sentences: <pre class="prettyprint"><code>text = ['cant railway station','citadel hotel',' police stn']. </code></pre> I need to form bigram pairs and store them in a variable. The problem is that when I do that, I get a pair of sentences instead of words. Here is what I did: <pre class="prettyprint"><code>text2 = [[word for word in line.split()] for line in text] bigrams = nltk.bigrams(text2) print(bigrams) </code></pre> which yields <pre class="prettyprint"><code>[(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn']) </code></pre> Can't railway station and citadel hotel form one bigram. What I want is <pre class="prettyprint"><code>[([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on... </code></pre> The last word of the first sentence should not merge with the first word of second sentence. What should I do to make it work?

Using list comprehensions and zip: <pre class="prettyprint"><code>>>> text = ["this is a sentence", "so is this one"] >>> bigrams = [b for l in text for b in zip(l.split(" ")[:-1], l.split(" ")[1:])] >>> print(bigrams) [('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this', 'one')] </code></pre>

Forming Bigrams of words in list of sentences with Python

Tags:

python

list

list-comprehension

nltk

collocation

I have a list of sentences:

text = ['cant railway station','citadel hotel',' police stn'].

I need to form bigram pairs and store them in a variable. The problem is that when I do that, I get a pair of sentences instead of words. Here is what I did:

text2 = [[word for word in line.split()] for line in text] bigrams = nltk.bigrams(text2) print(bigrams)

which yields

[(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn'])

Can't railway station and citadel hotel form one bigram. What I want is

[([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on...

The last word of the first sentence should not merge with the first word of second sentence. What should I do to make it work?

926

asked Feb 18 '14 04:02

Hypothetical Ninja

2 Answers

Using list comprehensions and zip:

>>> text = ["this is a sentence", "so is this one"] >>> bigrams = [b for l in text for b in zip(l.split(" ")[:-1], l.split(" ")[1:])] >>> print(bigrams) [('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this',      'one')]

107

answered Oct 03 '22 01:10

butch

from nltk import word_tokenize  from nltk.util import ngrams   text = ['cant railway station', 'citadel hotel', 'police stn'] for line in text:     token = nltk.word_tokenize(line)     bigram = list(ngrams(token, 2))       # the '2' represents bigram...you can change it to get ngrams with different size

answered Oct 03 '22 00:10

gurinder

Related questions
                            
                                json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
                            
                                how to override the verbose name of a superclass model field in django
                            
                                json.dumps messes up order
                            
                                Time difference in seconds (as a floating point)
                            
                                Distance between point and a line (from two points)
                            
                                Duplicate strings in a list and add integer suffixes to newly added ones
                            
                                How come list element lookup is O(1) in Python?
                            
                                How to change metadata with ffmpeg/avconv without creating a new file?
                            
                                delete all keys except one in dictionary
                            
                                Python: Read and write TIFF 16 bit , three channel , colour images
                            
                                Basemap import error in PyCharm — KeyError: 'PROJ_LIB'
                            
                                Is there a data type in Python similar to structs in C++?
                            
                                Get coordinates of local maxima in 2D array above certain value
                            
                                Find the smallest power of 2 greater than or equal to n in Python
                            
                                query from postgresql using python as dictionary
                            
                                How to disable ConvergenceWarning using sklearn?
                            
                                Is there a map without result in python?
                            
                                How to parse BaseHTTPRequestHandler.path
                            
                                Python: Adding 3 weeks to any date
                            
                                How to make a field conditionally optional in WTForms?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With