Is POS tagging deterministic?

Tags:

I have been trying to wrap my head around why this is happening but am hoping someone can shed some light on this. I am trying to tag the following text:

ae0.475      X  mod 
ae0.842      X  mod
ae0.842      X  mod 
ae0.775      X  mod

using the following code:

import nltk

file = open("test", "r")

for line in file:
        words = line.strip().split(' ')
        words = [word.strip() for word in words if word != '']
        tags = nltk.pos_tag(words)
        pos = [tags[x][1] for x in range(len(tags))]
        key = ' '.join(pos)
        print words, " : ", key

and am getting the following result:

['ae0.475', 'X', 'mod']  :  NN NNP NN
['ae0.842', 'X', 'mod']  :  -NONE- NNP NN
['ae0.842', 'X', 'mod']  :  -NONE- NNP NN
['ae0.775', 'X', 'mod']  :  NN NNP NN

And I don't get it. Does anyone know what is the reason for this inconsistency? I am not very particular about the accuracy about the pos tagging because I am attempting to extract some templates but it seems to be using different tags at different instances for a word that looks "almost" the same.

As a solution, I replaced all numbers with 1 and solved the problem:

['ae1.111', 'X', 'mod']  :  NN NNP NN
['ae1.111', 'X', 'mod']  :  NN NNP NN
['ae1.111', 'X', 'mod']  :  NN NNP NN
['ae1.111', 'X', 'mod']  :  NN NNP NN

but am curious why it tagged the instance with different tags in my first case. Any suggestions?

797

asked Jun 30 '11 21:06

Legend

2 Answers

My best effort to understand uncovered this from someone not using the whole Brown corpus:

Note that words that the tagger has not seen before, such as decried, receive a tag of None.

So, I guess something that looks like ae1.111 must appear in the corpus file, but nothing like ae0.842. That's kind of weird, but that's the reasoning for giving the -NONE- tag.

Edit: I got super-curious, downloaded the Brown corpus myself, and plain-text-searched inside it. The number 111 appears in it 34 times, and the number 842 only appears 4 times. 842 only appears either in the middle of dollar amounts or as the last 3 digits of a year, and 111 appears many times on its own as a page number. 775 also appears once as a page number.

So, I'm going to make a conjecture, that because of Benford's Law, you will end up matching numbers that start with 1s, 2s, and 3s much more often than numbers that start with 8s or 9s, since these are more often the page numbers of a random page that would be cited in a book. I'd be really interested in finding out if that's true (but not interested enough to do it myself, of course!).

answered Sep 25 '22 06:09

Chris Cunningham

It is "deterministic" in the sense that the same sentence is going to be tagged the same way using the same algorithm every time, but since your words aren't in nltk's data (in fact, aren't even real words in real sentences) it's going to use some algorithm to try to infer what the tags would be. That is going to mean that you can have different taggings when the words change (even if the change is a different number like you have) and that the taggings aren't going to make much sense anyway.

Which makes me wonder why you're trying to use NLP for non-natural language constructs.

answered Sep 22 '22 06:09

trutheality

Related questions
                            
                                Does Python Pickle have an illegal character/sequence I can use as a separator?
                            
                                Why are some callable attributes not listed by the dir() function?
                            
                                Is this a bug? Variables are identical references to the same string in this example (Python)
                            
                                IPython doesn't find the Shell.IPShell class
                            
                                Tracking global migration to Python 3.x
                            
                                Nonlinear e^(-x) regression using scipy, python, numpy
                            
                                Good way to generate GUIDs on app engine?
                            
                                Efficiently solving a letter/number problem in Python
                            
                                The Web 2.0 Ecosystem/Stack
                            
                                Video meta data using python
                            
                                How do the count the number of sentences, words and characters in a file?
                            
                                Installation of pygtk not working
                            
                                python RESTful webservice framework: roll my own or is there a recommended library?
                            
                                xauth using python-oauth2
                            
                                py2exe throws ImportError: DLL load failed: The specified module could not be found
                            
                                How do you extract feed urls from an OPML file exported from Google Reader?
                            
                                In Django, Can I `defer()` fields in an object that's being queried by `select_related()`
                            
                                Calling Python functions from inline C with scipy.weave
                            
                                Getting the output of a python subprocess
                            
                                How can Python regex ignore case inside a part of a pattern but not the entire expression? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is POS tagging deterministic?

Tags:

python

machine-learning

nlp

nltk

Legend

People also ask

2 Answers

Chris Cunningham

trutheality

Recent Activity

Donate For Us