I'm looking for a good open source POS Tagger in Java. Here's what I have come up with so far.
Anybody got any recommendations?
Tokenization and Parts of Speech(POS) Tagging in Python's NLTK library. Python's NLTK library features a robust sentence tokenizer and POS tagger. Python has a native tokenizer, the .
Summary. POS Tagging in NLTK is a process to mark up the words in text format for a particular part of a speech based on its definition and context. Some NLTK POS tagging examples are: CC, CD, EX, JJ, MD, NNP, PDT, PRP$, TO, etc. POS tagger is used to assign grammatical information of each word of the sentence.
VBG. verb, present participle or gerund. stirring focusing approaching erasing. VBN. verb, past participle.
TaggerI - Base class The base class of these taggers is TaggerI, means all the taggers inherit from this class.
Are you looking to tag POS in a specific domain? Most of the general purpose taggers are trained on newswire text. Typically they don't perform well when you are using them in specific domains (such and biomedical text). There are other taggers specifically trained for such domains such as dTagger (java) for biomedical text.
For newswire text, Adwait Ratnaparkhi's MXPOST is very good and is the one I would recommend.
Other Java implementations include:
OpenNLP and Lingpipe as posted by the other posters are also pretty decent.
Info on the state-of-the-art on POS tagging can be found here. As you can see LTAG-Spinal (also mentioned by another poster) ranks best as of now, but the variation across the various taggers is not much. I have not used LTAG myself.
Also note that the baseline performance for POS tagging is about 90%. Baseline means - (a) tag every word by most frequent POS tag from a lexicon, and (b) tag every unknown word as a noun.
I have used OpenNLP with good results. You can also check out MorphAdorner.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With