Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does nltk.pos_tag() work?

Tags:

python

nlp

nltk

How does nltk.pos_tag() work? Does it involve any corpus use? I found a source code (nltk.tag - NLTK 3.0 documentation) and it says

_POS_TAGGER = 'taggers/maxent_treebank_pos_tagger/english.pickle'.

Loading _POS_TAGGER gives an object:

nltk.tag.sequential.ClassifierBasedPOSTagger

, which seems to have no training from corpus. The tagging is incorrect when I use a few adjective in series before a noun (e.g. the quick brown fox). I wonder if I can improve the result by using better tagging method or somehow training with better corpus. Any suggestions?

like image 481
Nabi Hayang Avatar asked Aug 14 '15 18:08

Nabi Hayang


2 Answers

According to the source code, pos_tag uses NLTK's currently reccomended POS tagger, which is PerceptronTagger as of 2018.

Here is the documentation for PerceptronTagger and here is the source code.

To use the tagger you can simply call pos_tag(tokens). This will call PerceptronTagger's default constructor, which uses a "pretrained" model. This is a pickled model that NLTK distributes, file located at: taggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle. This is trained and tested on the Wall Street Journal corpus.

Alternatively, you can instantiate a PerceptronTagger and train its model yourself by providing tagged examples, e.g.:

tagger = PerceptronTagger(load=False) # don't load existing model
tagger.train([[('today','NN'),('is','VBZ'),('good','JJ'),('day','NN')],
[('yes','NNS'),('it','PRP'),('beautiful','JJ')]])

The documentation links to this blog post which does a good job of describing the theory.

TL;DR: PerceptronTagger is a greedy averaged perceptron tagger. This basically means that it has a dictionary of weights associated with features, which it uses to predict the correct tag for a given set of features. During training, the tagger guesses a tag and adjusts weights according to whether or not the guess was correct. "Averaged" means the weight adjustments are averaged over the number of iterations.

like image 91
user812786 Avatar answered Oct 05 '22 06:10

user812786


The tagger is a machine-learning tagger that has been trained and saved for you. No tagger is perfect, but if you want optimal performance you shouldn't try to roll your own. Look around for state-of-the art taggers that are free to download and use, such as the Stanford tagger, for which the NLTK provides an interface.

For the Stanford tagger in particular, see help(nltk.tag.stanford). You'll need to download the Stanford tools yourself from http://nlp.stanford.edu/software/.

like image 38
alexis Avatar answered Oct 05 '22 05:10

alexis