Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

FreqDist in NLTK not sorting output

Tags:

python

nlp

nltk

I'm new to Python and I'm trying to teach myself language processing. NLTK in python has a function called FreqDist that gives the frequency of words in a text, but for some reason it's not working properly.

This is what the tutorial has me write:

fdist1 = FreqDist(text1)
vocabulary1 = fdist1.keys()
vocabulary1[:50]

So basically it's supposed to give me a list of the 50 most frequent words in the text. When I run the code, though, the result is the 50 least frequent words in order of least frequent to most frequent, as opposed to the other way around. The output I am getting is as follows:

[u'succour', u'four', u'woods', u'hanging', u'woody', u'conjure', u'looking', u'eligible', u'scold', u'unsuitableness', u'meadows', u'stipulate', u'leisurely', u'bringing', u'disturb', u'internally', u'hostess', u'mohrs', u'persisted', u'Does', u'succession', u'tired', u'cordially', u'pulse', u'elegant', u'second', u'sooth', u'shrugging', u'abundantly', u'errors', u'forgetting', u'contributed', u'fingers', u'increasing', u'exclamations', u'hero', u'leaning', u'Truth', u'here', u'china', u'hers', u'natured', u'substance', u'unwillingness...]

I'm copying the tutorial exactly, but I must be doing something wrong.

Here is the link to the tutorial:

http://www.nltk.org/book/ch01.html#sec-computing-with-language-texts-and-words

The example is right under the heading "Figure 1.3: Counting Words Appearing in a Text (a frequency distribution)"

Does anyone know how I might fix this?

like image 813
user3528925 Avatar asked Apr 13 '14 12:04

user3528925


1 Answers

From NLTK's GitHub:

FreqDist in NLTK3 is a wrapper for collections.Counter; Counter provides most_common() method to return items in order. FreqDist.keys() method is provided by standard library; it is not overridden. I think it is good we're becoming more compatible with stdlib.

docs at googlecode are very old, they are from 2011. More up-to-date docs can be found on http://nltk.org website.

So for NLKT version 3, instead of fdist1.keys()[:50], use fdist1.most_common(50).

The tutorial has also been updated:

fdist1 = FreqDist(text1)
>>> print(fdist1)
<FreqDist with 19317 samples and 260819 outcomes>
>>> fdist1.most_common(50)
[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024),
('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982),
("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124),
('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632),
('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280),
('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103),
('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005),
('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767),
('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680),
('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)]
>>> fdist1['whale']
906
like image 74
Hugo Avatar answered Oct 05 '22 23:10

Hugo