I'm new to Python and I'm trying to teach myself language processing. NLTK in python has a function called FreqDist that gives the frequency of words in a text, but for some reason it's not working properly.
This is what the tutorial has me write:
fdist1 = FreqDist(text1)
vocabulary1 = fdist1.keys()
vocabulary1[:50]
So basically it's supposed to give me a list of the 50 most frequent words in the text. When I run the code, though, the result is the 50 least frequent words in order of least frequent to most frequent, as opposed to the other way around. The output I am getting is as follows:
[u'succour', u'four', u'woods', u'hanging', u'woody', u'conjure', u'looking', u'eligible', u'scold', u'unsuitableness', u'meadows', u'stipulate', u'leisurely', u'bringing', u'disturb', u'internally', u'hostess', u'mohrs', u'persisted', u'Does', u'succession', u'tired', u'cordially', u'pulse', u'elegant', u'second', u'sooth', u'shrugging', u'abundantly', u'errors', u'forgetting', u'contributed', u'fingers', u'increasing', u'exclamations', u'hero', u'leaning', u'Truth', u'here', u'china', u'hers', u'natured', u'substance', u'unwillingness...]
I'm copying the tutorial exactly, but I must be doing something wrong.
Here is the link to the tutorial:
http://www.nltk.org/book/ch01.html#sec-computing-with-language-texts-and-words
The example is right under the heading "Figure 1.3: Counting Words Appearing in a Text (a frequency distribution)"
Does anyone know how I might fix this?
From NLTK's GitHub:
FreqDist in NLTK3 is a wrapper for collections.Counter; Counter provides
most_common()
method to return items in order.FreqDist.keys()
method is provided by standard library; it is not overridden. I think it is good we're becoming more compatible with stdlib.docs at googlecode are very old, they are from 2011. More up-to-date docs can be found on http://nltk.org website.
So for NLKT version 3, instead of fdist1.keys()[:50]
, use fdist1.most_common(50)
.
The tutorial has also been updated:
fdist1 = FreqDist(text1)
>>> print(fdist1)
<FreqDist with 19317 samples and 260819 outcomes>
>>> fdist1.most_common(50)
[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024),
('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982),
("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124),
('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632),
('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280),
('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103),
('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005),
('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767),
('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680),
('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)]
>>> fdist1['whale']
906
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With