Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Forming Bigrams of words in list of sentences with Python

I have a list of sentences:

text = ['cant railway station','citadel hotel',' police stn'].  

I need to form bigram pairs and store them in a variable. The problem is that when I do that, I get a pair of sentences instead of words. Here is what I did:

text2 = [[word for word in line.split()] for line in text] bigrams = nltk.bigrams(text2) print(bigrams) 

which yields

[(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn']) 

Can't railway station and citadel hotel form one bigram. What I want is

[([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on... 

The last word of the first sentence should not merge with the first word of second sentence. What should I do to make it work?

like image 926
Hypothetical Ninja Avatar asked Feb 18 '14 04:02

Hypothetical Ninja


People also ask

How do I get bigrams in python?

First, we need to generate such word pairs from the existing sentence maintain their current sequences. Such pairs are called bigrams. Python has a bigram function as part of NLTK library which helps us generate these pairs.

What is bigram in NLTK?

nltk.bigrams() returns an iterator (a generator specifically) of bigrams. If you want a list, pass the iterator to list() .

What is Unigrams and bigrams in python?

In natural language processing, an n-gram is an arrangement of n words. For example “Python” is a unigram (n = 1), “Data Science” is a bigram (n = 2), “Natural language preparing” is a trigram (n = 3) etc. Here our focus will be on implementing the unigrams(single words) models in python.


2 Answers

Using list comprehensions and zip:

>>> text = ["this is a sentence", "so is this one"] >>> bigrams = [b for l in text for b in zip(l.split(" ")[:-1], l.split(" ")[1:])] >>> print(bigrams) [('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this',      'one')] 
like image 107
butch Avatar answered Oct 03 '22 01:10

butch


from nltk import word_tokenize  from nltk.util import ngrams   text = ['cant railway station', 'citadel hotel', 'police stn'] for line in text:     token = nltk.word_tokenize(line)     bigram = list(ngrams(token, 2))       # the '2' represents bigram...you can change it to get ngrams with different size 
like image 44
gurinder Avatar answered Oct 03 '22 00:10

gurinder