I am trying to find collocations with NLTK in a text by using the built-in method.
Now I am having the following example text (test and foo follow each other, but there is a sentence border in between):
content_part = """test. foo 0 test. foo 1 test.
foo 2 test. foo 3 test. foo 4 test. foo 5"""
Result from tokenization and collocations()
is as follows:
print nltk.word_tokenize(content_part)
# ['test.', 'foo', 'my', 'test.', 'foo', '1', 'test.',
# 'foo', '2', 'test.', 'foo', '3', 'test.', 'foo', '4', 'test.', 'foo', '5']
print nltk.Text(nltk.word_tokenize(content_part)).collocations()
# test. foo
How can I prevent NLTK from:
So in this example it should not print any collocation at all, but I guess you can imagine more complicated texts where there are also collocations within sentences.
I can guess that I need to use the Punkt sentence segmenter, but then I do not know how to put them together again to find collocations with nltk (collocation()
seems to be more mighty than just counting stuff myself).
You could use WordPunctTokenizer to separate the punctuation from words and later filter out the bigrams with punctuation with apply_word_filter().
Same thing may be used for trigrams for not finding collocations over sentence borders.
from nltk import bigrams
from nltk import collocations
from nltk import FreqDist
from nltk.collocations import *
from nltk import WordPunctTokenizer
content_part = """test. foo 0 test. foo 1 test.
foo 2 test. foo 3 test. foo 4 test, foo 4 test."""
tokens = WordPunctTokenizer().tokenize(content_part)
bigram_measures = collocations.BigramAssocMeasures()
word_fd = FreqDist(tokens)
bigram_fd = FreqDist(bigrams(tokens))
finder = BigramCollocationFinder(word_fd, bigram_fd)
finder.apply_word_filter(lambda w: w in ('.', ','))
scored = finder.score_ngrams(bigram_measures.raw_freq)
print tokens
print sorted(finder.nbest(bigram_measures.raw_freq,2),reverse=True)
Output:
['test', '.', 'foo', '0', 'test', '.', 'foo', '1', 'test', '.', 'foo', '2', 'test', '.', 'foo', '3', 'test', '.', 'foo', '4', 'test', ',', 'foo', '4', 'test', '.']
[('4', 'test'), ('foo', '4')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With