Python nltk: Find collocations without dot-separated words

Question

I am trying to find collocations with NLTK in a text by using the built-in method.

Now I am having the following example text (test and foo follow each other, but there is a sentence border in between):

content_part = """test. foo 0 test. foo 1 test. 
               foo 2 test. foo 3 test. foo 4 test. foo 5"""

Result from tokenization and collocations() is as follows:

print nltk.word_tokenize(content_part)
# ['test.', 'foo', 'my', 'test.', 'foo', '1', 'test.',
# 'foo', '2', 'test.', 'foo', '3', 'test.', 'foo', '4', 'test.', 'foo', '5']

print nltk.Text(nltk.word_tokenize(content_part)).collocations()
# test. foo

How can I prevent NLTK from:

Including the dot in my tokenization
Not find collocations() over sentence borders?

So in this example it should not print any collocation at all, but I guess you can imagine more complicated texts where there are also collocations within sentences.

I can guess that I need to use the Punkt sentence segmenter, but then I do not know how to put them together again to find collocations with nltk (collocation() seems to be more mighty than just counting stuff myself).

wishiknew · Accepted Answer

You could use WordPunctTokenizer to separate the punctuation from words and later filter out the bigrams with punctuation with apply_word_filter().

Same thing may be used for trigrams for not finding collocations over sentence borders.

from nltk import bigrams
from nltk import collocations
from nltk import FreqDist
from nltk.collocations import *
from nltk import WordPunctTokenizer

content_part = """test. foo 0 test. foo 1 test. 
               foo 2 test. foo 3 test. foo 4 test, foo 4 test."""

tokens = WordPunctTokenizer().tokenize(content_part)

bigram_measures = collocations.BigramAssocMeasures()
word_fd = FreqDist(tokens)
bigram_fd = FreqDist(bigrams(tokens))
finder = BigramCollocationFinder(word_fd, bigram_fd)

finder.apply_word_filter(lambda w: w in ('.', ','))

scored = finder.score_ngrams(bigram_measures.raw_freq)

print tokens
print sorted(finder.nbest(bigram_measures.raw_freq,2),reverse=True)

Output:

['test', '.', 'foo', '0', 'test', '.', 'foo', '1', 'test', '.', 'foo', '2', 'test', '.', 'foo', '3', 'test', '.', 'foo', '4', 'test', ',', 'foo', '4', 'test', '.']
[('4', 'test'), ('foo', '4')]

Python nltk: Find collocations without dot-separated words

Tags:

python

nltk

aufziehvogel

1 Answers

wishiknew

Recent Activity

Donate For Us

Python nltk: Find collocations without dot-separated words

Tags:

python

nltk

aufziehvogel

1 Answers

wishiknew

Related questions

Recent Activity

Donate For Us