I'm unsure if I've understood correctly how the FreqDist functions works on Python. As I am following a tutorial, I am led to believe that the following code constructs a frequency distribution for a given list of words and calculates the top x frequently used words. (In the example below let corpus be an NLTK corpus and file to be a filename of a file in that corpus) <pre class="prettyprint"><code>words = corpus.words('file.txt') fd_words = nltk.FreqDist(word.lower() for word in words) fd_words.items()[:x] </code></pre> However, when I go through the following commands on Python, it seems to suggest otherwise: <pre class="prettyprint"><code>>>> from nltk import * >>> fdist = FreqDist(['hi','my','name','is','my','name']) >>> fdist FreqDist({'my': 2, 'name':2, 'is':1, 'hi':1} >>> fdist.items() [('is',1),('hi',1),('my',2),('name',2)] >>> fdist.items[:2] [('is',1),('hi',1)] </code></pre> The fdist.items()[:x] method is in fact returning the x least common words? Can someone tell me if I have done something wrong or if the mistake lies in the tutorial I am following?

By default a <code>FreqDist</code> is not sorted. I think you are looking for <code>most_common</code> method: <pre class="prettyprint"><code>from nltk import FreqDist fdist = FreqDist(['hi','my','name','is','my','name']) fdist.most_common(2) </code></pre> Returns: <pre class="prettyprint"><code>[('my', 2), ('name', 2)] </code></pre>

Python: How to compute the top X most frequently used words in an NLTK corpus? [duplicate]

Tags:

python

nltk

I'm unsure if I've understood correctly how the FreqDist functions works on Python. As I am following a tutorial, I am led to believe that the following code constructs a frequency distribution for a given list of words and calculates the top x frequently used words. (In the example below let corpus be an NLTK corpus and file to be a filename of a file in that corpus)

words = corpus.words('file.txt')
fd_words = nltk.FreqDist(word.lower() for word in words)
fd_words.items()[:x]

However, when I go through the following commands on Python, it seems to suggest otherwise:

>>> from nltk import *
>>> fdist = FreqDist(['hi','my','name','is','my','name'])
>>> fdist
FreqDist({'my': 2, 'name':2, 'is':1, 'hi':1}
>>> fdist.items()
[('is',1),('hi',1),('my',2),('name',2)]
>>> fdist.items[:2]
[('is',1),('hi',1)]

The fdist.items()[:x] method is in fact returning the x least common words?

Can someone tell me if I have done something wrong or if the mistake lies in the tutorial I am following?

516

asked Jan 29 '16 14:01

Wolff

1 Answers

By default a FreqDist is not sorted. I think you are looking for most_common method:

from nltk import FreqDist
fdist = FreqDist(['hi','my','name','is','my','name'])
fdist.most_common(2)

Returns:

[('my', 2), ('name', 2)]

135

answered Dec 04 '22 23:12

Jerzy Pawlikowski

Related questions
                            
                                Linear programming with scipy.optimize.linprog
                            
                                dtype changes when using DataFrame.to_dict
                            
                                how to multiply pandas dataframe with numpy array with broadcasting
                            
                                python mock global function that is used in class
                            
                                Launch concurrent.futures.ProcessPoolExecutor with initialization?
                            
                                What do -u, -m parameters do?
                            
                                Trouble installing "distribute": NameError: name 'sys_platform' is not defined
                            
                                python pandas parse datetime string with months names
                            
                                how to find JAR: /home/hadoop/contrib/streaming/hadoop-streaming.jar
                            
                                WordNetLemmatizer not returning the right lemma unless POS is explicit - Python NLTK
                            
                                Conditional indexing with Numpy ndarray
                            
                                ImportError: No module named win32timezone when i make a singleone exe from a python script with pyInstaller
                            
                                How often does Python switch threads?
                            
                                How to include constraint to Scipy NNLS function solution so that it sums to 1
                            
                                Creating a display-only (non-editable) Django admin field
                            
                                Is there any difference between numpy.std and excel STDEV function?
                            
                                How to save output in music21 as a MIDI file?
                            
                                pip3 ImportError: cannot import name 'IncompleteRead'
                            
                                Write dataframe to excel with a title
                            
                                KeyError: 'HTTP_HOST' when running django tests

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With