Having this: <pre class="prettyprint"><code>text = word_tokenize("The quick brown fox jumps over the lazy dog") </code></pre> And running: <pre class="prettyprint"><code>nltk.pos_tag(text) </code></pre> I get: <pre class="prettyprint"><code>[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')] </code></pre> This is incorrect. The tags for <code>quick brown lazy</code> in the sentence should be: <pre class="prettyprint"><code>('quick', 'JJ'), ('brown', 'JJ') , ('lazy', 'JJ') </code></pre> Testing this through their online tool gives the same result; <code>quick</code>, <code>brown</code> and <code>fox</code> should be adjectives not nouns.

In short: <blockquote> NLTK is not perfect. In fact, no model is perfect. </blockquote> Note: As of NLTK version 3.1, default <code>pos_tag</code> function is no longer the old MaxEnt English pickle. It is now the perceptron tagger from @Honnibal's implementation, see <code>nltk.tag.pos_tag</code> <pre class="prettyprint"><code>>>> import inspect >>> print inspect.getsource(pos_tag) def pos_tag(tokens, tagset=None): tagger = PerceptronTagger() return _pos_tag(tokens, tagset, tagger) </code></pre> Still it's better but not perfect: <pre class="prettyprint"><code>>>> from nltk import pos_tag >>> pos_tag("The quick brown fox jumps over the lazy dog".split()) [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')] </code></pre> At some point, if someone wants <code>TL;DR</code> solutions, see https://github.com/alvations/nltk_cli <hr> In long: Try using other tagger (see https://github.com/nltk/nltk/tree/develop/nltk/tag) , e.g.: <ul> <li>HunPos</li> <li>Stanford POS</li> <li>Senna</li> </ul> Using default MaxEnt POS tagger from NLTK, i.e. <code>nltk.pos_tag</code>: <pre class="prettyprint"><code>>>> from nltk import word_tokenize, pos_tag >>> text = "The quick brown fox jumps over the lazy dog" >>> pos_tag(word_tokenize(text)) [('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')] </code></pre> Using Stanford POS tagger: <pre class="prettyprint"><code>$ cd ~ $ wget http://nlp.stanford.edu/software/stanford-postagger-2015-04-20.zip $ unzip stanford-postagger-2015-04-20.zip $ mv stanford-postagger-2015-04-20 stanford-postagger $ python >>> from os.path import expanduser >>> home = expanduser("~") >>> from nltk.tag.stanford import POSTagger >>> _path_to_model = home + '/stanford-postagger/models/english-bidirectional-distsim.tagger' >>> _path_to_jar = home + '/stanford-postagger/stanford-postagger.jar' >>> st = POSTagger(path_to_model=_path_to_model, path_to_jar=_path_to_jar) >>> text = "The quick brown fox jumps over the lazy dog" >>> st.tag(text.split()) [(u'The', u'DT'), (u'quick', u'JJ'), (u'brown', u'JJ'), (u'fox', u'NN'), (u'jumps', u'VBZ'), (u'over', u'IN'), (u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')] </code></pre> Using HunPOS (NOTE: the default encoding is ISO-8859-1 not UTF8): <pre class="prettyprint"><code>$ cd ~ $ wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz $ tar zxvf hunpos-1.0-linux.tgz $ wget https://hunpos.googlecode.com/files/en_wsj.model.gz $ gzip -d en_wsj.model.gz $ mv en_wsj.model hunpos-1.0-linux/ $ python >>> from os.path import expanduser >>> home = expanduser("~") >>> from nltk.tag.hunpos import HunposTagger >>> _path_to_bin = home + '/hunpos-1.0-linux/hunpos-tag' >>> _path_to_model = home + '/hunpos-1.0-linux/en_wsj.model' >>> ht = HunposTagger(path_to_model=_path_to_model, path_to_bin=_path_to_bin) >>> text = "The quick brown fox jumps over the lazy dog" >>> ht.tag(text.split()) [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')] </code></pre> Using Senna (Make sure you've the latest version of NLTK, there were some changes made to the API): <pre class="prettyprint"><code>$ cd ~ $ wget http://ronan.collobert.com/senna/senna-v3.0.tgz $ tar zxvf senna-v3.0.tgz $ python >>> from os.path import expanduser >>> home = expanduser("~") >>> from nltk.tag.senna import SennaTagger >>> st = SennaTagger(home+'/senna') >>> text = "The quick brown fox jumps over the lazy dog" >>> st.tag(text.split()) [('The', u'DT'), ('quick', u'JJ'), ('brown', u'JJ'), ('fox', u'NN'), ('jumps', u'VBZ'), ('over', u'IN'), ('the', u'DT'), ('lazy', u'JJ'), ('dog', u'NN')] </code></pre> <hr> Or try building a better POS tagger: <ul> <li>Ngram Tagger: http://streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1/ </li> <li>Affix/Regex Tagger: http://streamhacker.com/2008/11/10/part-of-speech-tagging-with-nltk-part-2/ </li> <li>Build Your Own Brill (Read the code it's a pretty fun tagger, http://www.nltk.org/_modules/nltk/tag/brill.html), see http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3/ </li> <li>Perceptron Tagger: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/ </li> <li>LDA Tagger: http://scm.io/blog/hack/2015/02/lda-intentions/ </li> </ul> <hr> Complains about <code>pos_tag</code> accuracy on stackoverflow include: <ul> <li>POS tagging - NLTK thinks noun is adjective</li> <li>python NLTK POS tagger not behaving as expected</li> <li>How to obtain better results using NLTK pos tag</li> <li>pos_tag in NLTK does not tag sentences correctly</li> </ul> Issues about NLTK HunPos include: <ul> <li> How do I tag textfiles with hunpos in nltk? </li> <li>Does anyone know how to configure the hunpos wrapper class on nltk?</li> </ul> Issues with NLTK and Stanford POS tagger include: <ul> <li>trouble importing stanford pos tagger into nltk</li> <li>Java Command Fails in NLTK Stanford POS Tagger</li> <li>Error using Stanford POS Tagger in NLTK Python</li> <li>How to improve speed with Stanford NLP Tagger and NLTK</li> <li>Nltk stanford pos tagger error : Java command failed</li> <li>Instantiating and using StanfordTagger within NLTK</li> <li>Running Stanford POS tagger in NLTK leads to "not a valid Win32 application" on Windows</li> </ul>

Solutions such as changing to the Stanford or Senna or HunPOS tagger will definitely yield results, but here is a much simpler way to experiment with different taggers that are also included within NLTK. The default POS tagger in NTLK right now is the averaged perceptron tagger. Here's a function that will opt to use the Maxent Treebank Tagger instead: <pre class="prettyprint"><code>def treebankTag(text) words = nltk.word_tokenize(text) treebankTagger = nltk.data.load('taggers/maxent_treebank_pos_tagger/english.pickle') return treebankTagger.tag(words) </code></pre> I have found that the averaged perceptron pre-trained tagger in NLTK is biased to treating some adjectives as nouns, as in your example. The treebank tagger has gotten more adjectives correct for me.

Python NLTK pos_tag not returning the correct part-of-speech tag

Tags:

python

machine-learning

nlp

nltk

pos-tagger

Having this:

text = word_tokenize("The quick brown fox jumps over the lazy dog")

And running:

nltk.pos_tag(text)

I get:

[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

This is incorrect. The tags for quick brown lazy in the sentence should be:

('quick', 'JJ'), ('brown', 'JJ') , ('lazy', 'JJ')

Testing this through their online tool gives the same result; quick, brown and fox should be adjectives not nouns.

580

asked Jun 13 '15 16:06

faceoff

2 Answers

In short:

NLTK is not perfect. In fact, no model is perfect.

Note:

As of NLTK version 3.1, default pos_tag function is no longer the old MaxEnt English pickle.

It is now the perceptron tagger from @Honnibal's implementation, see nltk.tag.pos_tag

>>> import inspect >>> print inspect.getsource(pos_tag) def pos_tag(tokens, tagset=None):     tagger = PerceptronTagger()     return _pos_tag(tokens, tagset, tagger)

Still it's better but not perfect:

>>> from nltk import pos_tag >>> pos_tag("The quick brown fox jumps over the lazy dog".split()) [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

At some point, if someone wants TL;DR solutions, see https://github.com/alvations/nltk_cli

In long:

Try using other tagger (see https://github.com/nltk/nltk/tree/develop/nltk/tag) , e.g.:

HunPos
Stanford POS
Senna

Using default MaxEnt POS tagger from NLTK, i.e. nltk.pos_tag:

>>> from nltk import word_tokenize, pos_tag >>> text = "The quick brown fox jumps over the lazy dog" >>> pos_tag(word_tokenize(text)) [('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

Using Stanford POS tagger:

$ cd ~ $ wget http://nlp.stanford.edu/software/stanford-postagger-2015-04-20.zip $ unzip stanford-postagger-2015-04-20.zip $ mv stanford-postagger-2015-04-20 stanford-postagger $ python >>> from os.path import expanduser >>> home = expanduser("~") >>> from nltk.tag.stanford import POSTagger >>> _path_to_model = home + '/stanford-postagger/models/english-bidirectional-distsim.tagger' >>> _path_to_jar = home + '/stanford-postagger/stanford-postagger.jar' >>> st = POSTagger(path_to_model=_path_to_model, path_to_jar=_path_to_jar) >>> text = "The quick brown fox jumps over the lazy dog" >>> st.tag(text.split()) [(u'The', u'DT'), (u'quick', u'JJ'), (u'brown', u'JJ'), (u'fox', u'NN'), (u'jumps', u'VBZ'), (u'over', u'IN'), (u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]

Using HunPOS (NOTE: the default encoding is ISO-8859-1 not UTF8):

$ cd ~ $ wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz $ tar zxvf hunpos-1.0-linux.tgz $ wget https://hunpos.googlecode.com/files/en_wsj.model.gz $ gzip -d en_wsj.model.gz  $ mv en_wsj.model hunpos-1.0-linux/ $ python >>> from os.path import expanduser >>> home = expanduser("~") >>> from nltk.tag.hunpos import HunposTagger >>> _path_to_bin = home + '/hunpos-1.0-linux/hunpos-tag' >>> _path_to_model = home + '/hunpos-1.0-linux/en_wsj.model' >>> ht = HunposTagger(path_to_model=_path_to_model, path_to_bin=_path_to_bin) >>> text = "The quick brown fox jumps over the lazy dog" >>> ht.tag(text.split()) [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

Using Senna (Make sure you've the latest version of NLTK, there were some changes made to the API):

$ cd ~ $ wget http://ronan.collobert.com/senna/senna-v3.0.tgz $ tar zxvf senna-v3.0.tgz $ python >>> from os.path import expanduser >>> home = expanduser("~") >>> from nltk.tag.senna import SennaTagger >>> st = SennaTagger(home+'/senna') >>> text = "The quick brown fox jumps over the lazy dog" >>> st.tag(text.split()) [('The', u'DT'), ('quick', u'JJ'), ('brown', u'JJ'), ('fox', u'NN'), ('jumps', u'VBZ'), ('over', u'IN'), ('the', u'DT'), ('lazy', u'JJ'), ('dog', u'NN')]

Or try building a better POS tagger:

Ngram Tagger: http://streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1/
Affix/Regex Tagger: http://streamhacker.com/2008/11/10/part-of-speech-tagging-with-nltk-part-2/
Build Your Own Brill (Read the code it's a pretty fun tagger, http://www.nltk.org/_modules/nltk/tag/brill.html), see http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3/
Perceptron Tagger: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
LDA Tagger: http://scm.io/blog/hack/2015/02/lda-intentions/

Complains about pos_tag accuracy on stackoverflow include:

POS tagging - NLTK thinks noun is adjective
python NLTK POS tagger not behaving as expected
How to obtain better results using NLTK pos tag
pos_tag in NLTK does not tag sentences correctly

Issues about NLTK HunPos include:

How do I tag textfiles with hunpos in nltk?
Does anyone know how to configure the hunpos wrapper class on nltk?

Issues with NLTK and Stanford POS tagger include:

trouble importing stanford pos tagger into nltk
Java Command Fails in NLTK Stanford POS Tagger
Error using Stanford POS Tagger in NLTK Python
How to improve speed with Stanford NLP Tagger and NLTK
Nltk stanford pos tagger error : Java command failed
Instantiating and using StanfordTagger within NLTK
Running Stanford POS tagger in NLTK leads to "not a valid Win32 application" on Windows

129

answered Sep 24 '22 23:09

alvas

Solutions such as changing to the Stanford or Senna or HunPOS tagger will definitely yield results, but here is a much simpler way to experiment with different taggers that are also included within NLTK.

The default POS tagger in NTLK right now is the averaged perceptron tagger. Here's a function that will opt to use the Maxent Treebank Tagger instead:

def treebankTag(text)     words = nltk.word_tokenize(text)     treebankTagger = nltk.data.load('taggers/maxent_treebank_pos_tagger/english.pickle')     return treebankTagger.tag(words)

I have found that the averaged perceptron pre-trained tagger in NLTK is biased to treating some adjectives as nouns, as in your example. The treebank tagger has gotten more adjectives correct for me.

answered Sep 25 '22 23:09

Hockenmaier

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python NLTK pos_tag not returning the correct part-of-speech tag

Tags:

python

machine-learning

nlp

nltk

pos-tagger

faceoff

People also ask

2 Answers

alvas

Hockenmaier

Recent Activity

Donate For Us

Python NLTK pos_tag not returning the correct part-of-speech tag

Tags:

python

machine-learning

nlp

nltk

pos-tagger

faceoff

People also ask

2 Answers

alvas

Hockenmaier

Related questions

Recent Activity

Donate For Us