Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find collocations in text, python

How do you find collocations in text? A collocation is a sequence of words that occurs together unusually often. python has built-in func bigrams that returns word pairs.

>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>

What's left is to find bigrams that occur more often based on the frequency of individual words. Any ideas how to put it in the code?

like image 894
Gusto Avatar asked Nov 08 '10 22:11

Gusto


1 Answers

Try NLTK. You will mostly be interested in nltk.collocations.BigramCollocationFinder, but here is a quick demonstration to show you how to get started:

>>> import nltk
>>> def tokenize(sentences):
...     for sent in nltk.sent_tokenize(sentences.lower()):
...         for word in nltk.word_tokenize(sent):
...             yield word
... 

>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
<Text: mary had a little lamb ....>
>>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))

There are none in this small segment, but here goes:

>>> text.collocations(num=20)
Building collocations list
like image 144
Tim McNamara Avatar answered Oct 04 '22 04:10

Tim McNamara