In the nltk book there is the question "Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of men, women, and people in each document. What has happened to the usage of these words over time?" I thought I could use a function like state_union('1945-Truman.txt').count('men') However, there are over 60 texts in this State Union corpa and I feel like there has to be an easier way to see the count of these words for each one instead of repeating this function over and over for each text.

You can use the <code>.words()</code> function in the corpus to returns a list of strings (i.e. tokens/words): <pre class="prettyprint"><code>>>> from nltk.corpus import brown >>> brown.words() [u'The', u'Fulton', u'County', u'Grand', u'Jury', ...] </code></pre> Then use the <code>Counter()</code> object to count the instances, see https://docs.python.org/2/library/collections.html#collections.Counter: <pre class="prettyprint"><code>>>> wordcounts = Counter(brown.words()) </code></pre> But do note that the Counter is case-sensitive, see: <pre class="prettyprint"><code>>>> from nltk.corpus import brown >>> from collections import Counter >>> brown.words() [u'The', u'Fulton', u'County', u'Grand', u'Jury', ...] >>> wordcounts = Counter(brown.words()) >>> wordcounts['the'] 62713 >>> wordcounts['The'] 7258 >>> wordcounts_lower = Counter(i.lower() for i in brown.words()) >>> wordcounts_lower['The'] 0 >>> wordcounts_lower['the'] 69971 </code></pre>

nltk function to count occurrences of certain words

Tags:

nltk

corpus

In the nltk book there is the question "Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of men, women, and people in each document. What has happened to the usage of these words over time?"

I thought I could use a function like state_union('1945-Truman.txt').count('men') However, there are over 60 texts in this State Union corpa and I feel like there has to be an easier way to see the count of these words for each one instead of repeating this function over and over for each text.

982

asked Mar 31 '14 13:03

user3481246

1 Answers

You can use the .words() function in the corpus to returns a list of strings (i.e. tokens/words):

>>> from nltk.corpus import brown
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]

Then use the Counter() object to count the instances, see https://docs.python.org/2/library/collections.html#collections.Counter:

>>> wordcounts = Counter(brown.words())

But do note that the Counter is case-sensitive, see:

>>> from nltk.corpus import brown
>>> from collections import Counter
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> wordcounts = Counter(brown.words())
>>> wordcounts['the']
62713
>>> wordcounts['The']
7258
>>> wordcounts_lower = Counter(i.lower() for i in brown.words())
>>> wordcounts_lower['The']
0
>>> wordcounts_lower['the']
69971

131

answered Sep 22 '22 00:09

alvas

Related questions
                            
                                Identifying important words and phrases in text
                            
                                Dependency parser using NLTK and MaltParser
                            
                                utf-8 character in user path prevents module from being imported
                            
                                After training my own classifier with nltk, how do I load it in textblob?
                            
                                Using integers/dates as terminals in NLTK parser
                            
                                Dutch Grammar in python's NLTK
                            
                                detect allusions (e.g. very fuzzy matches) in language of inaugural addresses
                            
                                hierarchical classification + topic model training data for internet articles and social media
                            
                                Python child process silently crashes when issuing an HTTP request
                            
                                NLTK - Download all nltk data except corpara from command line without Downloader UI
                            
                                WordNetLemmatizer: Different handling of wn.ADJ and wn.ADJ_SAT?
                            
                                Semi-supervised Naive Bayes with NLTK [closed]
                            
                                Sklearn error when trying to call a new classifier - Python 3.4
                            
                                How to install nltk_data as package with pip? [duplicate]
                            
                                Should I use LingPipe or NLTK for extracting names and places?
                            
                                how to use the Gale-Church algorithm in Python-NLTK?
                            
                                Extracting information from unstructured text

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With