Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

counting n-gram frequency in python nltk

I have the following code. I know that I can use apply_freq_filter function to filter out collocations that are less than a frequency count. However, I don't know how to get the frequencies of all the n-gram tuples (in my case bi-gram) in a document, before I decide what frequency to set for filtering. As you can see I am using the nltk collocations class.

import nltk from nltk.collocations import * line = "" open_file = open('a_text_file','r') for val in open_file:     line += val tokens = line.split()  bigram_measures = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_words(tokens) finder.apply_freq_filter(3) print finder.nbest(bigram_measures.pmi, 100) 
like image 839
Rkz Avatar asked Jan 16 '13 18:01

Rkz


People also ask

What is frequency distribution in NLTK?

A frequency distribution records the number of times each outcome of an experi- ment has occured. For example, a frequency distribution could be used to record the frequency of each word type in a document. Frequency distributions are encoded by the FreqDist class, which is defined by the nltk.


1 Answers

NLTK comes with its own bigrams generator, as well as a convenient FreqDist() function.

f = open('a_text_file') raw = f.read()  tokens = nltk.word_tokenize(raw)  #Create your bigrams bgs = nltk.bigrams(tokens)  #compute frequency distribution for all the bigrams in the text fdist = nltk.FreqDist(bgs) for k,v in fdist.items():     print k,v 

Once you have access to the BiGrams and the frequency distributions, you can filter according to your needs.

Hope that helps.

like image 76
Ram Narasimhan Avatar answered Oct 06 '22 00:10

Ram Narasimhan