I have a series of text items- raw HTML from a MySQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching).
My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format:
"Try the hamburger" (in 44 reviews)
e.g., the "Review Highlights" section of this page:
http://www.yelp.com/biz/sushi-gen-los-angeles/
I have NLTK installed and I've played around with it a bit, but am honestly overwhelmed by the options. This seems like a rather common problem and I haven't been able to find a straightforward solution by searching here.
I suspect you don't just want the most common phrases, but rather you want the most interesting collocations. Otherwise, you could end up with an overrepresentation of phrases made up of common words and fewer interesting and informative phrases.
To do this, you'll essentially want to extract n-grams from your data and then find the ones that have the highest point wise mutual information (PMI). That is, you want to find the words that co-occur together much more than you would expect them to by chance.
The NLTK collocations how-to covers how to do this in a about 7 lines of code, e.g.:
import nltk from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures() trigram_measures = nltk.collocations.TrigramAssocMeasures() # change this to read in your data finder = BigramCollocationFinder.from_words( nltk.corpus.genesis.words('english-web.txt')) # only bigrams that appear 3+ times finder.apply_freq_filter(3) # return the 10 n-grams with the highest PMI finder.nbest(bigram_measures.pmi, 10)
I think what you're looking for is chunking. I recommended reading chapter 7 of the NLTK book or maybe my own article on chunk extraction. Both of these assume knowledge of part-of-speech tagging, which is covered in chapter 5.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With