Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is POS tagging deterministic?

I have been trying to wrap my head around why this is happening but am hoping someone can shed some light on this. I am trying to tag the following text:

ae0.475      X  mod 
ae0.842      X  mod
ae0.842      X  mod 
ae0.775      X  mod 

using the following code:

import nltk

file = open("test", "r")

for line in file:
        words = line.strip().split(' ')
        words = [word.strip() for word in words if word != '']
        tags = nltk.pos_tag(words)
        pos = [tags[x][1] for x in range(len(tags))]
        key = ' '.join(pos)
        print words, " : ", key

and am getting the following result:

['ae0.475', 'X', 'mod']  :  NN NNP NN
['ae0.842', 'X', 'mod']  :  -NONE- NNP NN
['ae0.842', 'X', 'mod']  :  -NONE- NNP NN
['ae0.775', 'X', 'mod']  :  NN NNP NN

And I don't get it. Does anyone know what is the reason for this inconsistency? I am not very particular about the accuracy about the pos tagging because I am attempting to extract some templates but it seems to be using different tags at different instances for a word that looks "almost" the same.

As a solution, I replaced all numbers with 1 and solved the problem:

['ae1.111', 'X', 'mod']  :  NN NNP NN
['ae1.111', 'X', 'mod']  :  NN NNP NN
['ae1.111', 'X', 'mod']  :  NN NNP NN
['ae1.111', 'X', 'mod']  :  NN NNP NN

but am curious why it tagged the instance with different tags in my first case. Any suggestions?

like image 797
Legend Avatar asked Jun 30 '11 21:06

Legend


People also ask

What are the issues with POS tagging?

The main problem with POS tagging is ambiguity. In English, many common words have multiple meanings and therefore multiple POS . The job of a POS tagger is to resolve this ambiguity accurately based on the context of use. For example, the word "shot" can be a noun or a verb.

What are the two main methods used for POS tagging?

POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill's tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms.

What is the probabilistic approach for POS tagging in NLP?

At some point of time in early school, when we learned grammar, we came to know that words can be classified into various categories, like Noun , Verb, Adjective, etc. These categories help us to understand the roles played by a word in a sentence.

What is the benefit of POS tagging?

POS tags make it possible for automatic text processing tools to take into account which part of speech each word is. This facilitates the use of linguistic criteria in addition to statistics.


2 Answers

My best effort to understand uncovered this from someone not using the whole Brown corpus:

Note that words that the tagger has not seen before, such as decried, receive a tag of None.

So, I guess something that looks like ae1.111 must appear in the corpus file, but nothing like ae0.842. That's kind of weird, but that's the reasoning for giving the -NONE- tag.

Edit: I got super-curious, downloaded the Brown corpus myself, and plain-text-searched inside it. The number 111 appears in it 34 times, and the number 842 only appears 4 times. 842 only appears either in the middle of dollar amounts or as the last 3 digits of a year, and 111 appears many times on its own as a page number. 775 also appears once as a page number.

So, I'm going to make a conjecture, that because of Benford's Law, you will end up matching numbers that start with 1s, 2s, and 3s much more often than numbers that start with 8s or 9s, since these are more often the page numbers of a random page that would be cited in a book. I'd be really interested in finding out if that's true (but not interested enough to do it myself, of course!).

like image 88
Chris Cunningham Avatar answered Sep 25 '22 06:09

Chris Cunningham


It is "deterministic" in the sense that the same sentence is going to be tagged the same way using the same algorithm every time, but since your words aren't in nltk's data (in fact, aren't even real words in real sentences) it's going to use some algorithm to try to infer what the tags would be. That is going to mean that you can have different taggings when the words change (even if the change is a different number like you have) and that the taggings aren't going to make much sense anyway.

Which makes me wonder why you're trying to use NLP for non-natural language constructs.

like image 36
trutheality Avatar answered Sep 22 '22 06:09

trutheality