Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to filter word permutations to only find semantically correct ngrams? (Python 3, NLTK)

I want to create a number of permutations from a list of 200 words -- this obviously creates a problem because this leads to some ridiculously gigantic number of possible permutations (up to 5 words in a phrase). In order to effectively limit this number I have a two pronged attack:

  1. Pass the words through a POS filter so that only linguistically sound phrases are created and
  2. filter by those permutations that are actual ngrams -- i.e. have a high PMI / likelihood score.

The second part of this concept has me wondering -- I know that NLTK offers the ability to find ngrams but every example I have seen analyzes a corpus, which makes sense because a freqdist is needed. However, is it possible to find the PMI of a word permutation?

Would it be possible to find the PMI score of my word permutations BASED on the common collocations found within a custom Corpus? Can it be done manually?

For example, while the permutation (the verbose tea) is linguistically sound, it is not a contextually good permutation.

I know the code to find common collocations within a block of text/corpus but this is a very unique problem which I was hoping someone could give some advice. At the very least, help me wrap my head around this!

Example

**KW**
 Ball
 Bat
 Pinch
 Home
 Run
 Base
 Hitter
 Pitcher
 Call
 etc...

MORE BACKGROUND: Now, there are a number of permutations that can be made from this list, but only a handful that would actually make sense. Passing this list through a POS filter allows me to create keywords that make linguistic sense -- but not those that are semantically correct i.e. "Call Ball Hitter". This is my struggle, to somehow create semantically correct permutations based on some sort of scoring criteria like a PMI. Now my idea was to scrape a website, i.e. http://en.wikipedia.org/wiki/Baseball, find common ngrams within it, and then somehow judge the relative semantic strength of a keyword permutation based on that corpus. But I am struggling to conceptualize this and am unsure if it is even possible. But Really, I would love to hear any other ideas about how to efficiently find ngram permutations! The exercise here boils down to efficiently eliminating nonsensical permutations without having to manually categorize/score everything!

like image 343
user3682157 Avatar asked Sep 05 '14 05:09

user3682157


2 Answers

Just thinking out loud here - the Google Books NGram Viewer has scraped its corpus and made public the list of all [1,2,3,4,5]-grams that appeared more than 40 times, and their frequency counts. So you could take each ngram that you generate and look up its frequency in the Google ngram database. Ngrams with a higher count are more likely to be semantically sound.

... Downside is that downloading Google's entire ngram dataset is like 1 TB and I don't know if they have an api for it.

EDIt:

I would be shocked if there wasn't an api for this. Also Google doesn't seem to be the only game in town, a quick search turned up:

  • Microsoft Web N-gram Services
  • www.ngrams.info
  • www.wordfrequency.info
like image 114
Mike Ounsworth Avatar answered Oct 19 '22 05:10

Mike Ounsworth


I figured out my own answer with what I think is a pretty nifty solution! It is based on this article: http://research.microsoft.com/en-us/um/people/jfgao/paper/webngram.sigirws.v2.pdf. The idea here is to NOT create a bunch of random garbage permutations and then sift through them to find the one semantically correct one. The idea is to ONLY create semantically correct permutations in the first place. This can be done by creating sentences in stages according to the basic principle of n-1, or the idea that a word is only semantically dependent on the preceeding word.

So the plan is to find all pairs of bigrams within a relevant corpus and their frequency. The higher the frequency, the more likely that expression is semantically correct. So say you have a list like this with bigrams that appear 10 times each in the corpus

The man
a plan
in Panama
Panama City
Man Who
Who is
is awesome

From there you construct sentences within STAGES according to N-1. So you take a starting keyword from your original list. From there, find a bigram in your second list that starts with that same word followed by another word and append them together. So for example take the word 'THE' from your original list and after looking through the above corpus, you should now have this as a phrase 'THE THE MAN'. Rinse and repeat with that phrase: look for a bigram that follows the n-1 principle and now find a bigram that starts with 'man'. You now have 'THE THE MAN MAN WHO'. Rinse and repeat! This should create phrases that are semantically in the right order (obviously you will remove duplicates at the end from the sentence).

What do you guys think?

like image 1
user3682157 Avatar answered Oct 19 '22 05:10

user3682157