Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Meaning of Stanford Spanish POS Tagger tags

I am tagging Spanish text with the Stanford POS Tagger (via NLTK in Python).

Here is my code:

import nltk
from nltk.tag.stanford import POSTagger
spanish_postagger = POSTagger('models/spanish.tagger', 'stanford-postagger.jar')
spanish_postagger.tag('esta es una oracion de prueba'.split())

The result is:

[(u'esta', u'pd000000'),
(u'es', u'vsip000'),
(u'una', u'di0000'),
(u'oracion', u'nc0s000'),
(u'de', u'sp000'),
(u'prueba', u'nc0s000')]

I want to know where can I found what exactly means pd000000, vsip000, di0000, nc0s000, sp000?

like image 263
Pedro Muñoz Avatar asked Nov 20 '14 19:11

Pedro Muñoz


People also ask

What do you mean by POS tagging?

Part-of-speech (POS) tagging is a popular Natural Language Processing process which refers to categorizing words in a text (corpus) in correspondence with a particular part of speech, depending on the definition of the word and its context.

What is POS tagger give examples?

POS Tagging in NLTK is a process to mark up the words in text format for a particular part of a speech based on its definition and context. Some NLTK POS tagging examples are: CC, CD, EX, JJ, MD, NNP, PDT, PRP$, TO, etc. POS tagger is used to assign grammatical information of each word of the sentence.

What are the different types of POS tagging?

We already know that parts of speech include nouns, verb, adverbs, adjectives, pronouns, conjunction and their sub-categories. Most of the POS tagging falls under Rule Base POS tagging, Stochastic POS tagging and Transformation based tagging.


1 Answers

This is a simplified version of the tagset used in the AnCora treebank. You can find their tagset documentation here: https://web.archive.org/web/20160325024315/http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html

The "simplification" consists of nulling out many of the final fields which don't strictly belong in a part-of-speech tag. For example, our part-of-speech tagger will always give you null (0) values for the NER field of the original tagset (see EAGLES noun documentation).

In short: the fields in the POS tags produced by our tagger correspond exactly to AnCora POS fields, but a lot of those fields will be null. For most practical purposes you'll only need to look at the first 2–4 characters of the tag. The first character always indicates the broad POS category, and the second character indicates some kind of subtype.


We're in the process of writing some introductory documentation for using Spanish with CoreNLP (that means understanding these tags, and much else) right now. For the moment, you can find more information on the first page of our technical documentation.

like image 157
Jon Gauthier Avatar answered Nov 06 '22 16:11

Jon Gauthier