I have been trying to wrap my head around why this is happening but am hoping someone can shed some light on this. I am trying to tag the following text:
ae0.475 X mod
ae0.842 X mod
ae0.842 X mod
ae0.775 X mod
using the following code:
import nltk
file = open("test", "r")
for line in file:
words = line.strip().split(' ')
words = [word.strip() for word in words if word != '']
tags = nltk.pos_tag(words)
pos = [tags[x][1] for x in range(len(tags))]
key = ' '.join(pos)
print words, " : ", key
and am getting the following result:
['ae0.475', 'X', 'mod'] : NN NNP NN
['ae0.842', 'X', 'mod'] : -NONE- NNP NN
['ae0.842', 'X', 'mod'] : -NONE- NNP NN
['ae0.775', 'X', 'mod'] : NN NNP NN
And I don't get it. Does anyone know what is the reason for this inconsistency? I am not very particular about the accuracy about the pos tagging because I am attempting to extract some templates but it seems to be using different tags at different instances for a word that looks "almost" the same.
As a solution, I replaced all numbers with 1 and solved the problem:
['ae1.111', 'X', 'mod'] : NN NNP NN
['ae1.111', 'X', 'mod'] : NN NNP NN
['ae1.111', 'X', 'mod'] : NN NNP NN
['ae1.111', 'X', 'mod'] : NN NNP NN
but am curious why it tagged the instance with different tags in my first case. Any suggestions?
The main problem with POS tagging is ambiguity. In English, many common words have multiple meanings and therefore multiple POS . The job of a POS tagger is to resolve this ambiguity accurately based on the context of use. For example, the word "shot" can be a noun or a verb.
POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill's tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms.
At some point of time in early school, when we learned grammar, we came to know that words can be classified into various categories, like Noun , Verb, Adjective, etc. These categories help us to understand the roles played by a word in a sentence.
POS tags make it possible for automatic text processing tools to take into account which part of speech each word is. This facilitates the use of linguistic criteria in addition to statistics.
My best effort to understand uncovered this from someone not using the whole Brown corpus:
Note that words that the tagger has not seen before, such as decried, receive a tag of None.
So, I guess something that looks like ae1.111
must appear in the corpus file, but nothing like ae0.842
. That's kind of weird, but that's the reasoning for giving the -NONE-
tag.
Edit: I got super-curious, downloaded the Brown corpus myself, and plain-text-searched inside it. The number 111
appears in it 34 times, and the number 842
only appears 4 times. 842
only appears either in the middle of dollar amounts or as the last 3 digits of a year, and 111
appears many times on its own as a page number. 775
also appears once as a page number.
So, I'm going to make a conjecture, that because of Benford's Law, you will end up matching numbers that start with 1s, 2s, and 3s much more often than numbers that start with 8s or 9s, since these are more often the page numbers of a random page that would be cited in a book. I'd be really interested in finding out if that's true (but not interested enough to do it myself, of course!).
It is "deterministic" in the sense that the same sentence is going to be tagged the same way using the same algorithm every time, but since your words aren't in nltk's data (in fact, aren't even real words in real sentences) it's going to use some algorithm to try to infer what the tags would be. That is going to mean that you can have different taggings when the words change (even if the change is a different number like you have) and that the taggings aren't going to make much sense anyway.
Which makes me wonder why you're trying to use NLP for non-natural language constructs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With