Chapter 5 of the Python NLTK book gives this example of tagging words in a sentence:
>>> text = nltk.word_tokenize("And now for something completely different") >>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
nltk.pos_tag calls the default tagger, which uses a full set of tags. Later in the chapter a simplified set of tags is introduced.
How can I tag sentences with this simplified set of part-of-speech tags?
Also have I understood the tagger correctly, i.e. can I change the tag set that the tagger uses as I'm asking, or should I map the tags it returns on to the simplified set, or should I create a new tagger from a new, simply-tagged corpus?
POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text format for a particular part of a speech based on its definition and context. It is responsible for text reading in a language and assigning some specific token (Parts of Speech) to each word. It is also called grammatical tagging.
IN preposition/subordinating conjunction. JJ adjective 'big' JJR adjective, comparative 'bigger' JJS adjective, superlative 'biggest'
Updated, in case anyone runs across the same problem. NLTK has since upgraded to a "universal" tagset, source here. Once you've tagged your text, use map_tag to simplify the tags.
import nltk from nltk.tag import pos_tag, map_tag text = nltk.word_tokenize("And now for something completely different") posTagged = pos_tag(text) simplifiedTags = [(word, map_tag('en-ptb', 'universal', tag)) for word, tag in posTagged] print(simplifiedTags) # [('And', u'CONJ'), ('now', u'ADV'), ('for', u'ADP'), ('something', u'NOUN'), ('completely', u'ADV'), ('different', u'ADJ')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With