Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count verbs, nouns, and other parts of speech with python's NLTK

I have multiple texts and I would like to create profiles of them based on their usage of various parts of speech, like nouns and verbs. Basially, I need to count how many times each part of speech is used.

I have tagged the text but am not sure how to go further:

tokens = nltk.word_tokenize(text.lower())
text = nltk.Text(tokens)
tags = nltk.pos_tag(text)

How can I save the counts for each part of speech into a variable?

like image 854
Zach Avatar asked May 20 '12 15:05

Zach


People also ask

What is part-of-speech NLTK?

The Natural Language Toolkit (NLTK) is a platform used for building programs for text analysis. One of the more powerful aspects of the NLTK module is the Part of Speech tagging. In order to run the below python program you must have to install NLTK.

How many part-of-speech tags does the universal Tagset in NLTK 3 have?

The Universal tagset of NLTK comprises of 12 tag classes: Verb, Noun, Pronouns, Adjectives, Adverbs, Adpositions, Conjunctions, Determiners, Cardinal Numbers, Particles, Other/ Foreign words, Punctuations.


1 Answers

The pos_tag method gives you back a list of (token, tag) pairs:

tagged = [('the', 'DT'), ('dog', 'NN'), ('sees', 'VB'), ('the', 'DT'), ('cat', 'NN')] 

If you are using Python 2.7 or later, then you can do it simply with:

>>> from collections import Counter
>>> counts = Counter(tag for word,tag in tagged)
>>> counts
Counter({'DT': 2, 'NN': 2, 'VB': 1})

To normalize the counts (giving you the proportion of each) do:

>>> total = sum(counts.values())
>>> dict((word, float(count)/total) for word,count in counts.items())
{'DT': 0.4, 'VB': 0.2, 'NN': 0.4}

Note that in older versions of Python, you'll have to implement Counter yourself:

>>> from collections import defaultdict
>>> counts = defaultdict(int)
>>> for word, tag in tagged:
...  counts[tag] += 1

>>> counts
defaultdict(<type 'int'>, {'DT': 2, 'VB': 1, 'NN': 2})
like image 125
dhg Avatar answered Sep 21 '22 14:09

dhg