NLTK - Counting Frequency of Bigram

Q: What is bigram frequency?

A bigram frequency measures how often a pair of letters occurs. For instance, take the ratio of the number of times 'c' comes before 'd' (1 time) with the total number of pairs (64 times).

Q: What is nltk bigram?

nltk.bigrams() returns an iterator (a generator specifically) of bigrams. If you want a list, pass the iterator to list() . It also expects a sequence of items to generate bigrams from, so you have to split the text before passing it (if you had not done it): bigrm = list(nltk.bigrams(text.split()))

Q: What is frequency distribution in nltk?

A frequency distribution records the number of times each outcome of an experi- ment has occured. For example, a frequency distribution could be used to record the frequency of each word type in a document. Frequency distributions are encoded by the FreqDist class, which is defined by the nltk.

Q: What is n-gram frequency?

From Glottopedia. The mean, or summed, frequency of all fragments of a word of a given length. Most commonly used is bigram frequency, using fragments of length 2. The word 'dog' will contain 2 bigrams: 'do' and 'og'.

Tags:

python

nlp

nltk

This is a Python and NLTK newbie question.

I want to find frequency of bigrams which occur more than 10 times together and have the highest PMI.

For this, I am working with this code

def get_list_phrases(text):

    tweet_phrases = []

    for tweet in text:
        tweet_words = tweet.split()
        tweet_phrases.extend(tweet_words)


    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(tweet_phrases,window_size = 13)
    finder.apply_freq_filter(10)
    finder.nbest(bigram_measures.pmi,20)  

    for k,v in finder.ngram_fd.items():
      print(k,v)

However, this does not restricts the results to top 20. I see results which have frequency < 10. I am new to the world of Python.

Can someone please point out how to modify this to get only the top 20.

Thank You

475

asked Oct 02 '13 19:10

jainp

1 Answers

The problem is with the way you are trying to use apply_freq_filter. We are discussing about word collocations. As you know, a word collocation is about dependency between words. The BigramCollocationFinder class inherits from a class named AbstractCollocationFinder and the function apply_freq_filter belongs to this class. apply_freq_filter is not supposed to totally delete some word collocations, but to provide a filtered list of collocations if some other functions try to access the list.

Now why is that? Imagine that if filtering collocations was simply deleting them, then there were many probability measures such as likelihood ratio or the PMI itself (that compute probability of a word relative to other words in a corpus) which would not function properly after deleting words from random positions in the given corpus. By deleting some collocations from the given list of words, many potential functionalities and computations would be disabled. Also, computing all of these measures before the deletion, would bring a massive computation overhead which the user might not need after all.

Now, the question is how to correctly use the apply_freq_filter function? There are a few ways. In the following I will show the problem and its solution.

Lets define a sample corpus and split it to a list of words similar to what you have done:

tweet_phrases = "I love iphone . I am so in love with iphone . iphone is great . samsung is great . iphone sucks. I really really love iphone cases. samsung can never beat iphone . samsung is better than apple"
from nltk.collocations import *
import nltk

For the purpose of experimenting I set the window size to 3:

finder = BigramCollocationFinder.from_words(tweet_phrases.split(), window_size = 3)
finder1 = BigramCollocationFinder.from_words(tweet_phrases.split(), window_size = 3)

Notice that for the sake of comparison I only use the filter on finder1:

finder1.apply_freq_filter(2)
bigram_measures = nltk.collocations.BigramAssocMeasures()

Now if I write:

for k,v in finder.ngram_fd.items():
  print(k,v)

The output is:

(('.', 'is'), 3)
(('iphone', '.'), 3)
(('love', 'iphone'), 3)
(('.', 'iphone'), 2)
(('.', 'samsung'), 2)
(('great', '.'), 2)
(('iphone', 'I'), 2)
(('iphone', 'samsung'), 2)
(('is', '.'), 2)
(('is', 'great'), 2)
(('samsung', 'is'), 2)
(('.', 'I'), 1)
(('.', 'am'), 1)
(('.', 'sucks.'), 1)
(('I', 'am'), 1)
(('I', 'iphone'), 1)
(('I', 'love'), 1)
(('I', 'really'), 1)
(('I', 'so'), 1)
(('am', 'in'), 1)
(('am', 'so'), 1)
(('beat', '.'), 1)
(('beat', 'iphone'), 1)
(('better', 'apple'), 1)
(('better', 'than'), 1)
(('can', 'beat'), 1)
(('can', 'never'), 1)
(('cases.', 'can'), 1)
(('cases.', 'samsung'), 1)
(('great', 'iphone'), 1)
(('great', 'samsung'), 1)
(('in', 'love'), 1)
(('in', 'with'), 1)
(('iphone', 'cases.'), 1)
(('iphone', 'great'), 1)
(('iphone', 'is'), 1)
(('iphone', 'sucks.'), 1)
(('is', 'better'), 1)
(('is', 'than'), 1)
(('love', '.'), 1)
(('love', 'cases.'), 1)
(('love', 'with'), 1)
(('never', 'beat'), 1)
(('never', 'iphone'), 1)
(('really', 'iphone'), 1)
(('really', 'love'), 1)
(('samsung', 'better'), 1)
(('samsung', 'can'), 1)
(('samsung', 'great'), 1)
(('samsung', 'never'), 1)
(('so', 'in'), 1)
(('so', 'love'), 1)
(('sucks.', 'I'), 1)
(('sucks.', 'really'), 1)
(('than', 'apple'), 1)
(('with', '.'), 1)
(('with', 'iphone'), 1)

I will get the same result if I write the same for finder1. So, at first glance the filter doesn't work. However, see how it has worked: The trick is to use score_ngrams.

If I use score_ngrams on finder, it would be:

finder.score_ngrams (bigram_measures.pmi)

and the output is:

[(('am', 'in'), 5.285402218862249), (('am', 'so'), 5.285402218862249), (('better', 'apple'), 5.285402218862249), (('better', 'than'), 5.285402218862249), (('can', 'beat'), 5.285402218862249), (('can', 'never'), 5.285402218862249), (('cases.', 'can'), 5.285402218862249), (('in', 'with'), 5.285402218862249), (('never', 'beat'), 5.285402218862249), (('so', 'in'), 5.285402218862249), (('than', 'apple'), 5.285402218862249), (('sucks.', 'really'), 4.285402218862249), (('is', 'great'), 3.7004397181410926), (('I', 'am'), 3.7004397181410926), (('I', 'so'), 3.7004397181410926), (('cases.', 'samsung'), 3.7004397181410926), (('in', 'love'), 3.7004397181410926), (('is', 'better'), 3.7004397181410926), (('is', 'than'), 3.7004397181410926), (('love', 'cases.'), 3.7004397181410926), (('love', 'with'), 3.7004397181410926), (('samsung', 'better'), 3.7004397181410926), (('samsung', 'can'), 3.7004397181410926), (('samsung', 'never'), 3.7004397181410926), (('so', 'love'), 3.7004397181410926), (('sucks.', 'I'), 3.7004397181410926), (('samsung', 'is'), 3.115477217419936), (('.', 'am'), 2.9634741239748865), (('.', 'sucks.'), 2.9634741239748865), (('beat', '.'), 2.9634741239748865), (('with', '.'), 2.9634741239748865), (('.', 'is'), 2.963474123974886), (('great', '.'), 2.963474123974886), (('love', 'iphone'), 2.7004397181410926), (('I', 'really'), 2.7004397181410926), (('beat', 'iphone'), 2.7004397181410926), (('great', 'samsung'), 2.7004397181410926), (('iphone', 'cases.'), 2.7004397181410926), (('iphone', 'sucks.'), 2.7004397181410926), (('never', 'iphone'), 2.7004397181410926), (('really', 'love'), 2.7004397181410926), (('samsung', 'great'), 2.7004397181410926), (('with', 'iphone'), 2.7004397181410926), (('.', 'samsung'), 2.37851162325373), (('is', '.'), 2.37851162325373), (('iphone', 'I'), 2.1154772174199366), (('iphone', 'samsung'), 2.1154772174199366), (('I', 'love'), 2.115477217419936), (('iphone', '.'), 1.963474123974886), (('great', 'iphone'), 1.7004397181410922), (('iphone', 'great'), 1.7004397181410922), (('really', 'iphone'), 1.7004397181410922), (('.', 'iphone'), 1.37851162325373), (('.', 'I'), 1.37851162325373), (('love', '.'), 1.37851162325373), (('I', 'iphone'), 1.1154772174199366), (('iphone', 'is'), 1.1154772174199366)]

Now notice what happens when I compute the same for finder1 which was filtered to a frequency of 2:

finder1.score_ngrams(bigram_measures.pmi)

and the output:

[(('is', 'great'), 3.7004397181410926), (('samsung', 'is'), 3.115477217419936), (('.', 'is'), 2.963474123974886), (('great', '.'), 2.963474123974886), (('love', 'iphone'), 2.7004397181410926), (('.', 'samsung'), 2.37851162325373), (('is', '.'), 2.37851162325373), (('iphone', 'I'), 2.1154772174199366), (('iphone', 'samsung'), 2.1154772174199366), (('iphone', '.'), 1.963474123974886), (('.', 'iphone'), 1.37851162325373)]

Notice that all the collocations that had a frequency of less than 2 don't exist in this list; and it's exactly the result you were looking for. So the filter has worked. Also, the documentation gives a minimal hint about this issue.

I hope this has answered your question. Otherwise, please let me know.

Disclaimer: If you are primarily dealing with tweets, a window size of 13 is way too big. If you noticed, in my sample corpus the size of my sample tweets were too small that applying a window size of 13 can cause finding collocations that are irrelevant.

answered Sep 24 '22 15:09

user823743

Related questions
                            
                                Conda and Visual Studio Code debugging
                            
                                ValueError: continuous format is not supported
                            
                                How do I omit matplotlib printed output in Python / Jupyter notebook? [duplicate]
                            
                                Get overridden functions of subclass
                            
                                Convert X and Y arrays into a frequencies grid
                            
                                Does anyone know about workflow frameworks/libraries in Python?
                            
                                Matplotlib Contour Clabel Location
                            
                                python distutils does not include data_files
                            
                                Python Decorators and inheritance
                            
                                How can I use common code in python?
                            
                                Django admin and MongoDB, possible at all?
                            
                                How to create a legend for 3D bar in matplotlib?
                            
                                Python: Is there a way to get a local function variable from within a decorator that wraps it?
                            
                                Stacking astronomy images with Python
                            
                                Dynamic filepath & filename for FileHandler in logger config file in python
                            
                                Json dumping a dict throws TypeError: keys must be a string
                            
                                How do you unit test a nested function? [duplicate]
                            
                                Directing PyCharm to Python 3.3 interpreter?
                            
                                Passing memoryview to C function
                            
                                How can I generate documentation for a Python property setter using Sphinx?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With