I want to use NLTK to POS tag german texts. I found some references on the web, but most of the are outdated. Some reference for example a "EUROPARL" thesaurus, but it looks like only "EUROPARL_raw" is still available. And that one is not POS tagged. I found also some references to usage of the TIGER corpus, but the latest version seems to be I format I cannot parse with NLTK out of the box.
I'm aware of some non-NTLT alternatives, but I would prefer to use NLTK. Could somebody provide a simple example with POS tagging based on a german corpus?
I was unable to find a tagged corpus to use with NLTK. If you require a pre-tagged corpus you may be out of luck with NLTK. There is an open issue ticket for this very issue, but there has been no progress (Reading Negra Corpus Files)
You could tag your own corpus using the NLTK Trainer and the Negra Corpus. It would require knowledge of german grammar but no coding. See demonstration of the NLTK-Trainer.
Using the TIGER corpus for training a tagger is a good approach. It's now also available in CONLL09 format which can be loaded with NLTK. I used it in combination with Philipp Nolte's ClassifierBasedGermanTagger and got ~96% accuracy. I wrote a blog post on POS tagging of German texts with NLTK that explains how to get this running.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With