In this documentation, there is example using <code>nltk.collocations.BigramAssocMeasures()</code>, <code>BigramCollocationFinder</code>,<code>nltk.collocations.TrigramAssocMeasures()</code>, and <code>TrigramCollocationFinder</code>. There is example method find nbest based on pmi for bigram and trigram. example: <pre class="prettyprint"><code>finder = BigramCollocationFinder.from_words( ... nltk.corpus.genesis.words('english-web.txt')) >>> finder.nbest(bigram_measures.pmi, 10) </code></pre> I know that <code>BigramCollocationFinder</code> and <code>TrigramCollocationFinder</code> inherit from <code>AbstractCollocationFinder.</code> While <code>BigramAssocMeasures()</code> and <code>TrigramAssocMeasures()</code> inherit from <code>NgramAssocMeasures.</code> How can I use the methods(e.g. <code>nbest()</code>) in <code>AbstractCollocationFinder</code> and <code>NgramAssocMeasures</code> for 4-gram, 5-gram, 6-gram, ...., n-gram (like using bigram and trigram easily)? Should I create class which inherit <code>AbstractCollocationFinder</code>? Thanks.

If you want to find the grams beyond 2 or 3 grams you can use scikit package and Freqdist function to get the count for these grams. I tried doing this with nltk.collocations, but I dont think we can find out more than 3-grams score into it. So I rather decided to go with count of grams. I hope this can help u a little bit. Thankz here is the code <pre class="prettyprint"><code>from sklearn.metrics.pairwise import cosine_similarity from sklearn.feature_extraction.text import CountVectorizer from nltk.collocations import * from nltk.probability import FreqDist import nltk query = "This document gives a very short introduction to machine learning problems" vect = CountVectorizer(ngram_range=(1,4)) analyzer = vect.build_analyzer() listNgramQuery = analyzer(query) listNgramQuery.reverse() print "listNgramQuery=", listNgramQuery NgramQueryWeights = nltk.FreqDist(listNgramQuery) print "\nNgramQueryWeights=", NgramQueryWeights </code></pre> This will give output as <pre class="prettyprint"><code>listNgramQuery= [u'to machine learning problems', u'introduction to machine learning', u'short introduction to machine', u'very short introduction to', u'gives very short introduction', u'document gives very short', u'this document gives very', u'machine learning problems', u'to machine learning', u'introduction to machine', u'short introduction to', u'very short introduction', u'gives very short', u'document gives very', u'this document gives', u'learning problems', u'machine learning', u'to machine', u'introduction to', u'short introduction', u'very short', u'gives very', u'document gives', u'this document', u'problems', u'learning', u'machine', u'to', u'introduction', u'short', u'very', u'gives', u'document', u'this'] NgramQueryWeights= <FreqDist: u'document': 1, u'document gives': 1, u'document gives very': 1, u'document gives very short': 1, u'gives': 1, u'gives very': 1, u'gives very short': 1, u'gives very short introduction': 1, u'introduction': 1, u'introduction to': 1, ...> </code></pre>

How to get n-gram collocations and association in python nltk?

Tags:

python

nlp

nltk

n-gram

collocation

In this documentation, there is example using nltk.collocations.BigramAssocMeasures(), BigramCollocationFinder,nltk.collocations.TrigramAssocMeasures(), and TrigramCollocationFinder.

There is example method find nbest based on pmi for bigram and trigram. example:

finder = BigramCollocationFinder.from_words(
...     nltk.corpus.genesis.words('english-web.txt'))
>>> finder.nbest(bigram_measures.pmi, 10)

I know that BigramCollocationFinder and TrigramCollocationFinder inherit from AbstractCollocationFinder. While BigramAssocMeasures() and TrigramAssocMeasures() inherit from NgramAssocMeasures.

How can I use the methods(e.g. nbest()) in AbstractCollocationFinder and NgramAssocMeasures for 4-gram, 5-gram, 6-gram, ...., n-gram (like using bigram and trigram easily)?

Should I create class which inherit AbstractCollocationFinder?

Thanks.

338

asked Sep 07 '13 09:09

Fahmi Rizal

Video Answer

2 Answers

If you want to find the grams beyond 2 or 3 grams you can use scikit package and Freqdist function to get the count for these grams. I tried doing this with nltk.collocations, but I dont think we can find out more than 3-grams score into it. So I rather decided to go with count of grams. I hope this can help u a little bit. Thankz

here is the code

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk

query = "This document gives a very short introduction to machine learning problems"
vect = CountVectorizer(ngram_range=(1,4))
analyzer = vect.build_analyzer()
listNgramQuery = analyzer(query)
listNgramQuery.reverse()
print "listNgramQuery=", listNgramQuery
NgramQueryWeights = nltk.FreqDist(listNgramQuery)
print "\nNgramQueryWeights=", NgramQueryWeights

This will give output as

listNgramQuery= [u'to machine learning problems', u'introduction to machine learning', u'short introduction to machine', u'very short introduction to', u'gives very short introduction', u'document gives very short', u'this document gives very', u'machine learning problems', u'to machine learning', u'introduction to machine', u'short introduction to', u'very short introduction', u'gives very short', u'document gives very', u'this document gives', u'learning problems', u'machine learning', u'to machine', u'introduction to', u'short introduction', u'very short', u'gives very', u'document gives', u'this document', u'problems', u'learning', u'machine', u'to', u'introduction', u'short', u'very', u'gives', u'document', u'this']

NgramQueryWeights= <FreqDist: u'document': 1, u'document gives': 1, u'document gives very': 1, u'document gives very short': 1, u'gives': 1, u'gives very': 1, u'gives very short': 1, u'gives very short introduction': 1, u'introduction': 1, u'introduction to': 1, ...>

131

answered Oct 27 '22 14:10

Gunjan

Edited

The current NLTK has a hardcoder function for up to QuadCollocationFinder but the reasoning for why you cannot simply create an NgramCollocationFinder still stands, you would have to radically change the formulas in the from_words() function for different order of ngram.

Short answer, no you cannot simply create an AbstractCollocationFinder (ACF) to call the nbest() function if you want to find collocations beyond 2- and 3-grams.

It's because of the difference in the from_words() for different ngrams. You see that only the subclass of ACF (i.e. BigramCF and TrigramCF) have the from_words() function.

>>> finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt'))
>>> finder = AbstractCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt',5))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'AbstractCollocationFinder' has no attribute 'from_words'

So given this from_words() in TrigramCF:

from nltk.probability import FreqDist
@classmethod
def from_words(cls, words):
    wfd, wildfd, bfd, tfd = (FreqDist(),)*4

    for w1,w2,w3 in ingrams(words,3,pad_right=True):
      wfd.inc(w1)

      if w2 is None:
        continue
      bfd.inc((w1,w2))

      if w3 is None:
        continue
      wildfd.inc((w1,w3))
      tfd.inc((w1,w2,w3))

    return cls(wfd, bfd, wildfd, tfd)

You could somehow hack it and try to hardcode for a 4-gram association finder as such:

@classmethod
def from_words(cls, words):
    wfd, wildfd = (FreqDist(),)*2
    bfd, tfd ,fofd = (FreqDist(),)*3

    for w1,w2,w3,w4,w5 in ingrams(words,5,pad_right=True):
      wfd.inc(w1)

      if w2 is None:
        continue
      bfd.inc((w1,w2))

      if w3 is None:
        continue
      wildfd.inc((w1,w3))
      tfd.inc((w1,w2,w3))

      if w4 is None:
        continue
      wildfd.inc((w1,w4))
      wildfd.inc((w2,w4))
      wildfd.inc((w3,w4))
      wildfd.inc((w1,w3))
      wildfd.inc((w2,w3))
      wildfd.inc((w1,w2))
      ffd.inc((w1,w2,w3,w4))

    return cls(wfd, bfd, wildfd, tfd, ffd)

Then you would also have to change whichever part of the code that uses cls returned from the from_words respectively.

So you have to ask what is the ultimate purpose of finding the collocations?

If you're looking at retreiving words within collocations of larger than 2 or 3grams windows then you pretty much end up with a lot of noise in your word retrieval.
If you're going to build a model base on a collocation mode using 2 or 3grams windows then you will also face sparsity problems.

answered Oct 27 '22 13:10

alvas

Related questions
                            
                                How to get count of unpublished commit with GitPython?
                            
                                Python NameError: name 'ctypes' is not defined
                            
                                Open() and codecs.open() in Python 2.7 behave strangely different
                            
                                Write data to disk in Python as a background process
                            
                                Python - List comprehension with multiple arguments in the for
                            
                                Lifetime of default function arguments in python [duplicate]
                            
                                How To Add An Icon Of My Own To A Python Program
                            
                                Copy text between parentheses in pandas DataFrame column into another column
                            
                                Does anyone know a working example of 2dsphere index in pymongo?
                            
                                python - create a pivot table
                            
                                Django Nose how to write this test?
                            
                                python regular expression to remove repeated words
                            
                                what does 'if x.strip( )' mean?
                            
                                Trouble with basemap subplots
                            
                                How to get all datetime instances of the current week, given a day?
                            
                                How can I format a float using matplotlib's LaTeX formatter?
                            
                                Django CreateView gives an error "needs to have a value for field "..." before this many-to-many relationship can be used."
                            
                                Size of Python Counter
                            
                                Python execute code only if for loop did not begin iteration (with generator)?
                            
                                Fastest way to sort multiple lists - Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With