Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding NLTK collocation scoring for bigrams and trigrams

Tags:

python

nlp

nltk

Background:

I am trying to compare pairs of words to see which pair is "more likely to occur" in US English than another pair. My plan is/was to use the collocation facilities in NLTK to score word pairs, with the higher scoring pair being the most likely.

Approach:

I coded the following in Python using NLTK (several steps and imports removed for brevity):

bgm    = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_words(tokens) scored = finder.score_ngrams( bgm.likelihood_ratio  ) print scored 

Results:

I then examined the results using 2 word pairs, one of which should be highly likely to co-occur, and one pair which should not ("roasted cashews" and "gasoline cashews"). I was surprised to see these word pairing score identically:

[(('roasted', 'cashews'), 5.545177444479562)] [(('gasoline', 'cashews'), 5.545177444479562)] 

I would have expected 'roasted cashews' to score higher than 'gasoline cashews' in my test.

Questions:

  1. Am I misunderstanding the use of collocations?
  2. Is my code incorrect?
  3. Is my assumption that the scores should be different wrong, and if so why?

Thank you very much for any information or help!

like image 439
ccgillett Avatar asked Dec 30 '11 20:12

ccgillett


People also ask

How do you find collocations in NLTK?

You will mostly be interested in nltk. collocations. BigramCollocationFinder , but here is a quick demonstration to show you how to get started: >>> import nltk >>> def tokenize(sentences): ... for sent in nltk.

What is a collocation in NLP?

Collocations are phrases or expressions containing multiple words, that are highly likely to co-occur. For example — 'social media', 'school holiday', 'machine learning', 'Universal Studios Singapore', etc.

What is PMI in bigram?

The Pointwise Mutual Information (PMI) score for bigrams is: For trigrams: The main intuition is that it measures how much more likely the words co-occur than if they were independent. However, it is very sensitive to rare combination of words.

What is Unigrams and bigrams in python?

In natural language processing, an n-gram is an arrangement of n words. For example “Python” is a unigram (n = 1), “Data Science” is a bigram (n = 2), “Natural language preparing” is a trigram (n = 3) etc. Here our focus will be on implementing the unigrams(single words) models in python.


1 Answers

The NLTK collocations document seems pretty good to me. http://www.nltk.org/howto/collocations.html

You need to give the scorer some actual sizable corpus to work with. Here is a working example using the Brown corpus built into NLTK. It takes about 30 seconds to run.

import nltk.collocations import nltk.corpus import collections  bgm    = nltk.collocations.BigramAssocMeasures() finder = nltk.collocations.BigramCollocationFinder.from_words(     nltk.corpus.brown.words()) scored = finder.score_ngrams( bgm.likelihood_ratio  )  # Group bigrams by first word in bigram.                                         prefix_keys = collections.defaultdict(list) for key, scores in scored:    prefix_keys[key[0]].append((key[1], scores))  # Sort keyed bigrams by strongest association.                                   for key in prefix_keys:    prefix_keys[key].sort(key = lambda x: -x[1])  print 'doctor', prefix_keys['doctor'][:5] print 'baseball', prefix_keys['baseball'][:5] print 'happy', prefix_keys['happy'][:5] 

The output seems reasonable, works well for baseball, less so for doctor and happy.

doctor [('bills', 35.061321987405748), (',', 22.963930079491501),    ('annoys', 19.009636692022365),    ('had', 16.730384189212423), ('retorted', 15.190847940499127)]  baseball [('game', 32.110754519752291), ('cap', 27.81891372457088),    ('park', 23.509042621473505), ('games', 23.105033513054011),    ("player's",    16.227872863424668)]  happy [("''", 20.296341424483998), ('Spahn', 13.915820697905589),   ('family', 13.734352182441569),   (',', 13.55077617193821), ('bodybuilder', 13.513265447290536) 
like image 97
Rob Neuhaus Avatar answered Sep 19 '22 01:09

Rob Neuhaus