Understanding NLTK collocation scoring for bigrams and trigrams

Tags:

Background:

I am trying to compare pairs of words to see which pair is "more likely to occur" in US English than another pair. My plan is/was to use the collocation facilities in NLTK to score word pairs, with the higher scoring pair being the most likely.

Approach:

I coded the following in Python using NLTK (several steps and imports removed for brevity):

bgm    = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_words(tokens) scored = finder.score_ngrams( bgm.likelihood_ratio  ) print scored

Results:

I then examined the results using 2 word pairs, one of which should be highly likely to co-occur, and one pair which should not ("roasted cashews" and "gasoline cashews"). I was surprised to see these word pairing score identically:

[(('roasted', 'cashews'), 5.545177444479562)] [(('gasoline', 'cashews'), 5.545177444479562)]

I would have expected 'roasted cashews' to score higher than 'gasoline cashews' in my test.

Questions:

Am I misunderstanding the use of collocations?
Is my code incorrect?
Is my assumption that the scores should be different wrong, and if so why?

Thank you very much for any information or help!

439

asked Dec 30 '11 20:12

ccgillett

1 Answers

The NLTK collocations document seems pretty good to me. http://www.nltk.org/howto/collocations.html

You need to give the scorer some actual sizable corpus to work with. Here is a working example using the Brown corpus built into NLTK. It takes about 30 seconds to run.

import nltk.collocations import nltk.corpus import collections  bgm    = nltk.collocations.BigramAssocMeasures() finder = nltk.collocations.BigramCollocationFinder.from_words(     nltk.corpus.brown.words()) scored = finder.score_ngrams( bgm.likelihood_ratio  )  # Group bigrams by first word in bigram.                                         prefix_keys = collections.defaultdict(list) for key, scores in scored:    prefix_keys[key[0]].append((key[1], scores))  # Sort keyed bigrams by strongest association.                                   for key in prefix_keys:    prefix_keys[key].sort(key = lambda x: -x[1])  print 'doctor', prefix_keys['doctor'][:5] print 'baseball', prefix_keys['baseball'][:5] print 'happy', prefix_keys['happy'][:5]

The output seems reasonable, works well for baseball, less so for doctor and happy.

doctor [('bills', 35.061321987405748), (',', 22.963930079491501),    ('annoys', 19.009636692022365),    ('had', 16.730384189212423), ('retorted', 15.190847940499127)]  baseball [('game', 32.110754519752291), ('cap', 27.81891372457088),    ('park', 23.509042621473505), ('games', 23.105033513054011),    ("player's",    16.227872863424668)]  happy [("''", 20.296341424483998), ('Spahn', 13.915820697905589),   ('family', 13.734352182441569),   (',', 13.55077617193821), ('bodybuilder', 13.513265447290536)

answered Sep 19 '22 01:09

Rob Neuhaus

Related questions
                            
                                What is the recommended way to break long if statement? (W504 line break after binary operator)
                            
                                OpenCV Image Processing -- C++ vs C vs Python
                            
                                How to calculate the statistics "t-test" with numpy
                            
                                Django Storage Backend for S3
                            
                                What is the scope of a random seed in Python?
                            
                                Convert "unknown format" strings to datetime objects?
                            
                                Factory method for objects - best practice?
                            
                                How to hide .pyc files when you enter `ls` in bash
                            
                                Django: Error: Unknown command: 'makemigrations'
                            
                                python logging: how to ensure logfile directory is created?
                            
                                NumPy "record array" or "structured array" or "recarray"
                            
                                Preserve Dataframe column data type after outer merge
                            
                                What's the best way to test whether an sklearn model has been fitted?
                            
                                Numpy: views vs copy by slicing
                            
                                Colab - automatic authentication of connection to google drive, persistent per-notebook
                            
                                Python decorators that are part of a base class cannot be used to decorate member functions in inherited classes
                            
                                Generate permutations of list with repeated elements
                            
                                Single Django model, multiple tables?
                            
                                super and __new__ confusion
                            
                                What is the fastest way to add data to a list without duplication in python (2.5)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Understanding NLTK collocation scoring for bigrams and trigrams

Tags:

python

nlp

nltk

ccgillett

People also ask

1 Answers

Rob Neuhaus

Recent Activity

Donate For Us