Why does the Penn Treebank POS tagset have a separate tag for the word 'to'?

Tags:

The Penn Treebank tagset has a separate tag TO for the word 'to', irrespective of whether it's used in the preposition sense (such as I went to school) or the infinitive sense (such as I want to eat). What purpose does this serve from an overall NLP perspective? Just tagging the infinitival 'to' separately makes intuitive sense, but I don't see the logic behind combining an infinitive and a preposition in a single tag.

Thanks, and apologies if this doesn't fit the stack overflow guidelines.

705

asked Sep 29 '13 15:09

Sagar Ahire

1 Answers

Different corpora provide different levels of granularity. Compare this, for instance, to the British National Corpus, which includes three different tags for to.

I believe this may have come as a property of the corpus tagging practice rather than from such a specific NLP performance purpose. It's not that unlikely to imagine that it was a design decision of the POS Guidelines for the Penn Treebank Project. (Contacting the authors of this paper for further clarification.)

In order for the POS tagset not to have a separate tag for the word "to", it would sometimes need to tag "to" as a preposition, and to sometimes tag "to" with a different tag for "infinitive marker." For this to happen, a human tagger would have had to disambiguate between both roles of "to." Some tricky cases (which require grammaticality judgments) may require some extra human time to disambiguate, which may also lead to some mistagging given the size of the corpus tagged. This tradeoff may have erred more on the side of efficiency and correctness if the information gain (from the granularity of having to disambiguated) was estimated to be not that large, or if the potential tagging errors were estimated to be too many.

186

answered Dec 21 '22 12:12

arturomp

Related questions
                            
                                How to install and invoke Stanford NERTagger?
                            
                                BLEU score implementation for sentence similarity detection
                            
                                Is it better to use a "natural" language to write code?
                            
                                Memory Efficient data structure for Trie Implementation
                            
                                Implementing trigram markov model
                            
                                .NET dll for Natural language to SQL/SPARQL
                            
                                Paragraph Segmentation using Machine Learning
                            
                                spaCy and scikit-learn vectorizer
                            
                                Unable to train my keras model : (Data cardinality is ambiguous:)
                            
                                Online job-searching is tedious. Help me automate it
                            
                                Python: Clustering Search Engine Keywords
                            
                                How can I vary the sentence prefix "I am working on [X]" such that it has correct sentence structure for all X?
                            
                                Parsing HTML into sentences - how to handle tables/lists/headings/etc?
                            
                                How to extract keywords (tags) from text
                            
                                Obtain multiple taggings with Stanford POS Tagger

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does the Penn Treebank POS tagset have a separate tag for the word 'to'?

Tags:

nlp

pos-tagger

Sagar Ahire

People also ask

1 Answers

arturomp

Recent Activity

Donate For Us