Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: How to compute the top X most frequently used words in an NLTK corpus? [duplicate]

Tags:

python

nltk

I'm unsure if I've understood correctly how the FreqDist functions works on Python. As I am following a tutorial, I am led to believe that the following code constructs a frequency distribution for a given list of words and calculates the top x frequently used words. (In the example below let corpus be an NLTK corpus and file to be a filename of a file in that corpus)

words = corpus.words('file.txt')
fd_words = nltk.FreqDist(word.lower() for word in words)
fd_words.items()[:x]

However, when I go through the following commands on Python, it seems to suggest otherwise:

>>> from nltk import *
>>> fdist = FreqDist(['hi','my','name','is','my','name'])
>>> fdist
FreqDist({'my': 2, 'name':2, 'is':1, 'hi':1}
>>> fdist.items()
[('is',1),('hi',1),('my',2),('name',2)]
>>> fdist.items[:2]
[('is',1),('hi',1)]

The fdist.items()[:x] method is in fact returning the x least common words?

Can someone tell me if I have done something wrong or if the mistake lies in the tutorial I am following?

like image 516
Wolff Avatar asked Jan 29 '16 14:01

Wolff


People also ask

What is frequency distribution in NLTK?

A frequency distribution records the number of times each outcome of an experi- ment has occured. For example, a frequency distribution could be used to record the frequency of each word type in a document. Frequency distributions are encoded by the FreqDist class, which is defined by the nltk. probability module.

What is Gutenberg in NLTK?

1.1 Gutenberg Corpus NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/.

What is corpus in NLTK?

A corpus can be defined as a collection of text documents. It can be thought as just a bunch of text files in a directory, often alongside many other directories of text files. How it is done ? NLTK already defines a list of data paths or directories in nltk.


1 Answers

By default a FreqDist is not sorted. I think you are looking for most_common method:

from nltk import FreqDist
fdist = FreqDist(['hi','my','name','is','my','name'])
fdist.most_common(2)

Returns:

[('my', 2), ('name', 2)]
like image 135
Jerzy Pawlikowski Avatar answered Dec 04 '22 23:12

Jerzy Pawlikowski