The nltk
package's built-in part-of-speech tagger does not seem to be optimized for my use-case (here, for instance). The source code here shows that it's using a saved, pre-trained classifier called maxent_treebank_pos_tagger
.
What created maxent_treebank_pos_tagger/english.pickle
? I'm guessing that there is a tagged corpus out there somewhere that was used to train this tagger, so I think I'm looking for (a) that tagged corpus and (b) the exact code that trains the tagger based on the tagged corpus.
In addition to lots of googling, so far I tried to look at the .pickle
object directly to find any clues inside it, starting like this
from nltk.data import load
x = load("nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle")
dir(x)
The NLTK source is https://github.com/nltk/nltk/blob/develop/nltk/tag/init.py#L83
The original source of NLTK's MaxEnt POS tagger is from https://github.com/arne-cl/nltk-maxent-pos-tagger
Training Data: Wall Street Journal subset of the Penn Tree bank corpus
Features: Ratnaparki (1996)
Algorithm: Maximum Entropy
Accuracy: What is the accuracy of nltk pos_tagger?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With