Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python nltk: Find collocations without dot-separated words

Tags:

python

nltk

I am trying to find collocations with NLTK in a text by using the built-in method.

Now I am having the following example text (test and foo follow each other, but there is a sentence border in between):

content_part = """test. foo 0 test. foo 1 test. 
               foo 2 test. foo 3 test. foo 4 test. foo 5"""

Result from tokenization and collocations() is as follows:

print nltk.word_tokenize(content_part)
# ['test.', 'foo', 'my', 'test.', 'foo', '1', 'test.',
# 'foo', '2', 'test.', 'foo', '3', 'test.', 'foo', '4', 'test.', 'foo', '5']

print nltk.Text(nltk.word_tokenize(content_part)).collocations()
# test. foo

How can I prevent NLTK from:

  1. Including the dot in my tokenization
  2. Not find collocations() over sentence borders?

So in this example it should not print any collocation at all, but I guess you can imagine more complicated texts where there are also collocations within sentences.

I can guess that I need to use the Punkt sentence segmenter, but then I do not know how to put them together again to find collocations with nltk (collocation() seems to be more mighty than just counting stuff myself).

like image 324
aufziehvogel Avatar asked Feb 05 '12 17:02

aufziehvogel


1 Answers

You could use WordPunctTokenizer to separate the punctuation from words and later filter out the bigrams with punctuation with apply_word_filter().

Same thing may be used for trigrams for not finding collocations over sentence borders.

from nltk import bigrams
from nltk import collocations
from nltk import FreqDist
from nltk.collocations import *
from nltk import WordPunctTokenizer

content_part = """test. foo 0 test. foo 1 test. 
               foo 2 test. foo 3 test. foo 4 test, foo 4 test."""

tokens = WordPunctTokenizer().tokenize(content_part)

bigram_measures = collocations.BigramAssocMeasures()
word_fd = FreqDist(tokens)
bigram_fd = FreqDist(bigrams(tokens))
finder = BigramCollocationFinder(word_fd, bigram_fd)

finder.apply_word_filter(lambda w: w in ('.', ','))

scored = finder.score_ngrams(bigram_measures.raw_freq)

print tokens
print sorted(finder.nbest(bigram_measures.raw_freq,2),reverse=True)

Output:

['test', '.', 'foo', '0', 'test', '.', 'foo', '1', 'test', '.', 'foo', '2', 'test', '.', 'foo', '3', 'test', '.', 'foo', '4', 'test', ',', 'foo', '4', 'test', '.']
[('4', 'test'), ('foo', '4')]
like image 194
wishiknew Avatar answered Oct 18 '22 22:10

wishiknew