I have multiple texts and I would like to create profiles of them based on their usage of various parts of speech, like nouns and verbs. Basially, I need to count how many times each part of speech is used.
I have tagged the text but am not sure how to go further:
tokens = nltk.word_tokenize(text.lower())
text = nltk.Text(tokens)
tags = nltk.pos_tag(text)
How can I save the counts for each part of speech into a variable?
The Natural Language Toolkit (NLTK) is a platform used for building programs for text analysis. One of the more powerful aspects of the NLTK module is the Part of Speech tagging. In order to run the below python program you must have to install NLTK.
The Universal tagset of NLTK comprises of 12 tag classes: Verb, Noun, Pronouns, Adjectives, Adverbs, Adpositions, Conjunctions, Determiners, Cardinal Numbers, Particles, Other/ Foreign words, Punctuations.
The pos_tag
method gives you back a list of (token, tag) pairs:
tagged = [('the', 'DT'), ('dog', 'NN'), ('sees', 'VB'), ('the', 'DT'), ('cat', 'NN')]
If you are using Python 2.7 or later, then you can do it simply with:
>>> from collections import Counter
>>> counts = Counter(tag for word,tag in tagged)
>>> counts
Counter({'DT': 2, 'NN': 2, 'VB': 1})
To normalize the counts (giving you the proportion of each) do:
>>> total = sum(counts.values())
>>> dict((word, float(count)/total) for word,count in counts.items())
{'DT': 0.4, 'VB': 0.2, 'NN': 0.4}
Note that in older versions of Python, you'll have to implement Counter
yourself:
>>> from collections import defaultdict
>>> counts = defaultdict(int)
>>> for word, tag in tagged:
... counts[tag] += 1
>>> counts
defaultdict(<type 'int'>, {'DT': 2, 'VB': 1, 'NN': 2})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With