I have the following code. I know that I can use apply_freq_filter
function to filter out collocations that are less than a frequency count. However, I don't know how to get the frequencies of all the n-gram tuples (in my case bi-gram) in a document, before I decide what frequency to set for filtering. As you can see I am using the nltk collocations class.
import nltk from nltk.collocations import * line = "" open_file = open('a_text_file','r') for val in open_file: line += val tokens = line.split() bigram_measures = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_words(tokens) finder.apply_freq_filter(3) print finder.nbest(bigram_measures.pmi, 100)
A frequency distribution records the number of times each outcome of an experi- ment has occured. For example, a frequency distribution could be used to record the frequency of each word type in a document. Frequency distributions are encoded by the FreqDist class, which is defined by the nltk.
NLTK comes with its own bigrams generator
, as well as a convenient FreqDist()
function.
f = open('a_text_file') raw = f.read() tokens = nltk.word_tokenize(raw) #Create your bigrams bgs = nltk.bigrams(tokens) #compute frequency distribution for all the bigrams in the text fdist = nltk.FreqDist(bgs) for k,v in fdist.items(): print k,v
Once you have access to the BiGrams and the frequency distributions, you can filter according to your needs.
Hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With