Background:
I am trying to compare pairs of words to see which pair is "more likely to occur" in US English than another pair. My plan is/was to use the collocation facilities in NLTK to score word pairs, with the higher scoring pair being the most likely.
Approach:
I coded the following in Python using NLTK (several steps and imports removed for brevity):
bgm = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_words(tokens) scored = finder.score_ngrams( bgm.likelihood_ratio ) print scored
Results:
I then examined the results using 2 word pairs, one of which should be highly likely to co-occur, and one pair which should not ("roasted cashews" and "gasoline cashews"). I was surprised to see these word pairing score identically:
[(('roasted', 'cashews'), 5.545177444479562)] [(('gasoline', 'cashews'), 5.545177444479562)]
I would have expected 'roasted cashews' to score higher than 'gasoline cashews' in my test.
Questions:
Thank you very much for any information or help!
You will mostly be interested in nltk. collocations. BigramCollocationFinder , but here is a quick demonstration to show you how to get started: >>> import nltk >>> def tokenize(sentences): ... for sent in nltk.
Collocations are phrases or expressions containing multiple words, that are highly likely to co-occur. For example — 'social media', 'school holiday', 'machine learning', 'Universal Studios Singapore', etc.
The Pointwise Mutual Information (PMI) score for bigrams is: For trigrams: The main intuition is that it measures how much more likely the words co-occur than if they were independent. However, it is very sensitive to rare combination of words.
In natural language processing, an n-gram is an arrangement of n words. For example “Python” is a unigram (n = 1), “Data Science” is a bigram (n = 2), “Natural language preparing” is a trigram (n = 3) etc. Here our focus will be on implementing the unigrams(single words) models in python.
The NLTK collocations document seems pretty good to me. http://www.nltk.org/howto/collocations.html
You need to give the scorer some actual sizable corpus to work with. Here is a working example using the Brown corpus built into NLTK. It takes about 30 seconds to run.
import nltk.collocations import nltk.corpus import collections bgm = nltk.collocations.BigramAssocMeasures() finder = nltk.collocations.BigramCollocationFinder.from_words( nltk.corpus.brown.words()) scored = finder.score_ngrams( bgm.likelihood_ratio ) # Group bigrams by first word in bigram. prefix_keys = collections.defaultdict(list) for key, scores in scored: prefix_keys[key[0]].append((key[1], scores)) # Sort keyed bigrams by strongest association. for key in prefix_keys: prefix_keys[key].sort(key = lambda x: -x[1]) print 'doctor', prefix_keys['doctor'][:5] print 'baseball', prefix_keys['baseball'][:5] print 'happy', prefix_keys['happy'][:5]
The output seems reasonable, works well for baseball, less so for doctor and happy.
doctor [('bills', 35.061321987405748), (',', 22.963930079491501), ('annoys', 19.009636692022365), ('had', 16.730384189212423), ('retorted', 15.190847940499127)] baseball [('game', 32.110754519752291), ('cap', 27.81891372457088), ('park', 23.509042621473505), ('games', 23.105033513054011), ("player's", 16.227872863424668)] happy [("''", 20.296341424483998), ('Spahn', 13.915820697905589), ('family', 13.734352182441569), (',', 13.55077617193821), ('bodybuilder', 13.513265447290536)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With