Pos tagging german texts using NLTK

Question

I want to use NLTK to POS tag german texts. I found some references on the web, but most of the are outdated. Some reference for example a "EUROPARL" thesaurus, but it looks like only "EUROPARL_raw" is still available. And that one is not POS tagged. I found also some references to usage of the TIGER corpus, but the latest version seems to be I format I cannot parse with NLTK out of the box.

I'm aware of some non-NTLT alternatives, but I would prefer to use NLTK. Could somebody provide a simple example with POS tagging based on a german corpus?

BigHandsome · Accepted Answer

I was unable to find a tagged corpus to use with NLTK. If you require a pre-tagged corpus you may be out of luck with NLTK. There is an open issue ticket for this very issue, but there has been no progress (Reading Negra Corpus Files)

You could tag your own corpus using the NLTK Trainer and the Negra Corpus. It would require knowledge of german grammar but no coding. See demonstration of the NLTK-Trainer.

IsaacKleiner · Answer

Using the TIGER corpus for training a tagger is a good approach. It's now also available in CONLL09 format which can be loaded with NLTK. I used it in combination with Philipp Nolte's ClassifierBasedGermanTagger and got ~96% accuracy. I wrote a blog post on POS tagging of German texts with NLTK that explains how to get this running.

Pos tagging german texts using NLTK

Tags:

python

nltk

pos-tagger

Achim

2 Answers

BigHandsome

IsaacKleiner

Recent Activity

Donate For Us

Pos tagging german texts using NLTK

Tags:

python

nltk

pos-tagger

Achim

2 Answers

BigHandsome

IsaacKleiner

Related questions

Recent Activity

Donate For Us