Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

nltk function to count occurrences of certain words

Tags:

nltk

corpus

In the nltk book there is the question "Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of men, women, and people in each document. What has happened to the usage of these words over time?"

I thought I could use a function like state_union('1945-Truman.txt').count('men') However, there are over 60 texts in this State Union corpa and I feel like there has to be an easier way to see the count of these words for each one instead of repeating this function over and over for each text.

like image 982
user3481246 Avatar asked Mar 31 '14 13:03

user3481246


People also ask

How do you count occurrences of each word in a string in Python?

Python Code:def word_count(str): counts = dict() words = str. split() for word in words: if word in counts: counts[word] += 1 else: counts[word] = 1 return counts print( word_count('the quick brown fox jumps over the lazy dog. '))

How do you count words in NLTK?

So in Python using the nltk module, we can tokenize strings either into words or sentences. We then simply use the len() function to find the number of words or sentences in the string.

How do you count certain words in Python?

Using the count() Function The "standard" way (no external libraries) to get the count of word occurrences in a list is by using the list object's count() function. The count() method is a built-in function that takes an element as its only argument and returns the number of times that element appears in the list.

How do you count occurrences in Python?

The easiest way to count the number of occurrences in a Python list of a given item is to use the Python . count() method. The method is applied to a given list and takes a single argument. The argument passed into the method is counted and the number of occurrences of that item in the list is returned.


1 Answers

You can use the .words() function in the corpus to returns a list of strings (i.e. tokens/words):

>>> from nltk.corpus import brown
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]

Then use the Counter() object to count the instances, see https://docs.python.org/2/library/collections.html#collections.Counter:

>>> wordcounts = Counter(brown.words())

But do note that the Counter is case-sensitive, see:

>>> from nltk.corpus import brown
>>> from collections import Counter
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> wordcounts = Counter(brown.words())
>>> wordcounts['the']
62713
>>> wordcounts['The']
7258
>>> wordcounts_lower = Counter(i.lower() for i in brown.words())
>>> wordcounts_lower['The']
0
>>> wordcounts_lower['the']
69971
like image 131
alvas Avatar answered Sep 22 '22 00:09

alvas