I am tagging Spanish text with the Stanford POS Tagger (via NLTK in Python).
Here is my code:
import nltk
from nltk.tag.stanford import POSTagger
spanish_postagger = POSTagger('models/spanish.tagger', 'stanford-postagger.jar')
spanish_postagger.tag('esta es una oracion de prueba'.split())
The result is:
[(u'esta', u'pd000000'),
(u'es', u'vsip000'),
(u'una', u'di0000'),
(u'oracion', u'nc0s000'),
(u'de', u'sp000'),
(u'prueba', u'nc0s000')]
I want to know where can I found what exactly means pd000000, vsip000, di0000, nc0s000, sp000?
Part-of-speech (POS) tagging is a popular Natural Language Processing process which refers to categorizing words in a text (corpus) in correspondence with a particular part of speech, depending on the definition of the word and its context.
POS Tagging in NLTK is a process to mark up the words in text format for a particular part of a speech based on its definition and context. Some NLTK POS tagging examples are: CC, CD, EX, JJ, MD, NNP, PDT, PRP$, TO, etc. POS tagger is used to assign grammatical information of each word of the sentence.
We already know that parts of speech include nouns, verb, adverbs, adjectives, pronouns, conjunction and their sub-categories. Most of the POS tagging falls under Rule Base POS tagging, Stochastic POS tagging and Transformation based tagging.
This is a simplified version of the tagset used in the AnCora treebank. You can find their tagset documentation here: https://web.archive.org/web/20160325024315/http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html
The "simplification" consists of nulling out many of the final fields which don't strictly belong in a part-of-speech tag. For example, our part-of-speech tagger will always give you null (0
) values for the NER field of the original tagset (see EAGLES noun documentation).
In short: the fields in the POS tags produced by our tagger correspond exactly to AnCora POS fields, but a lot of those fields will be null. For most practical purposes you'll only need to look at the first 2–4 characters of the tag. The first character always indicates the broad POS category, and the second character indicates some kind of subtype.
We're in the process of writing some introductory documentation for using Spanish with CoreNLP (that means understanding these tags, and much else) right now. For the moment, you can find more information on the first page of our technical documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With