Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does the Penn Treebank POS tagset have a separate tag for the word 'to'?

Tags:

nlp

pos-tagger

The Penn Treebank tagset has a separate tag TO for the word 'to', irrespective of whether it's used in the preposition sense (such as I went to school) or the infinitive sense (such as I want to eat). What purpose does this serve from an overall NLP perspective? Just tagging the infinitival 'to' separately makes intuitive sense, but I don't see the logic behind combining an infinitive and a preposition in a single tag.

Thanks, and apologies if this doesn't fit the stack overflow guidelines.

like image 705
Sagar Ahire Avatar asked Sep 29 '13 15:09

Sagar Ahire


People also ask

What is POS tag for word?

What is a POS tag? A POS tag (or part-of-speech tag) is a special label assigned to each token (word) in a text corpus to indicate the part of speech and often also other grammatical categories such as tense, number (plural/singular), case etc.

How many unique POS tags are present in the treebank corpus?

It contains 36 POS tags and 12 other tags (for punctuation and currency symbols). ... ... set of syntactic tags and null elements used in the skeletal bracketing are given in Table 1.

How do you use POS tags?

In simple words, we can say that POS tagging is a task of labelling each word in a sentence with its appropriate part of speech. We already know that parts of speech include nouns, verb, adverbs, adjectives, pronouns, conjunction and their sub-categories.


1 Answers

Different corpora provide different levels of granularity. Compare this, for instance, to the British National Corpus, which includes three different tags for to.

I believe this may have come as a property of the corpus tagging practice rather than from such a specific NLP performance purpose. It's not that unlikely to imagine that it was a design decision of the POS Guidelines for the Penn Treebank Project. (Contacting the authors of this paper for further clarification.)

In order for the POS tagset not to have a separate tag for the word "to", it would sometimes need to tag "to" as a preposition, and to sometimes tag "to" with a different tag for "infinitive marker." For this to happen, a human tagger would have had to disambiguate between both roles of "to." Some tricky cases (which require grammaticality judgments) may require some extra human time to disambiguate, which may also lead to some mistagging given the size of the corpus tagged. This tradeoff may have erred more on the side of efficiency and correctness if the information gain (from the granularity of having to disambiguated) was estimated to be not that large, or if the potential tagging errors were estimated to be too many.

like image 186
arturomp Avatar answered Dec 21 '22 12:12

arturomp