The Penn Treebank tagset has a separate tag TO
for the word 'to', irrespective of whether it's used in the preposition sense (such as I went to school
) or the infinitive sense (such as I want to eat
). What purpose does this serve from an overall NLP perspective? Just tagging the infinitival 'to' separately makes intuitive sense, but I don't see the logic behind combining an infinitive and a preposition in a single tag.
Thanks, and apologies if this doesn't fit the stack overflow guidelines.
What is a POS tag? A POS tag (or part-of-speech tag) is a special label assigned to each token (word) in a text corpus to indicate the part of speech and often also other grammatical categories such as tense, number (plural/singular), case etc.
It contains 36 POS tags and 12 other tags (for punctuation and currency symbols). ... ... set of syntactic tags and null elements used in the skeletal bracketing are given in Table 1.
In simple words, we can say that POS tagging is a task of labelling each word in a sentence with its appropriate part of speech. We already know that parts of speech include nouns, verb, adverbs, adjectives, pronouns, conjunction and their sub-categories.
Different corpora provide different levels of granularity. Compare this, for instance, to the British National Corpus, which includes three different tags for to.
I believe this may have come as a property of the corpus tagging practice rather than from such a specific NLP performance purpose. It's not that unlikely to imagine that it was a design decision of the POS Guidelines for the Penn Treebank Project. (Contacting the authors of this paper for further clarification.)
In order for the POS tagset not to have a separate tag for the word "to", it would sometimes need to tag "to" as a preposition, and to sometimes tag "to" with a different tag for "infinitive marker." For this to happen, a human tagger would have had to disambiguate between both roles of "to." Some tricky cases (which require grammaticality judgments) may require some extra human time to disambiguate, which may also lead to some mistagging given the size of the corpus tagged. This tradeoff may have erred more on the side of efficiency and correctness if the information gain (from the granularity of having to disambiguated) was estimated to be not that large, or if the potential tagging errors were estimated to be too many.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With