Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pos tagging german texts using NLTK

I want to use NLTK to POS tag german texts. I found some references on the web, but most of the are outdated. Some reference for example a "EUROPARL" thesaurus, but it looks like only "EUROPARL_raw" is still available. And that one is not POS tagged. I found also some references to usage of the TIGER corpus, but the latest version seems to be I format I cannot parse with NLTK out of the box.

I'm aware of some non-NTLT alternatives, but I would prefer to use NLTK. Could somebody provide a simple example with POS tagging based on a german corpus?

like image 527
Achim Avatar asked Dec 02 '13 16:12

Achim


2 Answers

I was unable to find a tagged corpus to use with NLTK. If you require a pre-tagged corpus you may be out of luck with NLTK. There is an open issue ticket for this very issue, but there has been no progress (Reading Negra Corpus Files)

You could tag your own corpus using the NLTK Trainer and the Negra Corpus. It would require knowledge of german grammar but no coding. See demonstration of the NLTK-Trainer.

like image 189
BigHandsome Avatar answered Oct 24 '22 10:10

BigHandsome


Using the TIGER corpus for training a tagger is a good approach. It's now also available in CONLL09 format which can be loaded with NLTK. I used it in combination with Philipp Nolte's ClassifierBasedGermanTagger and got ~96% accuracy. I wrote a blog post on POS tagging of German texts with NLTK that explains how to get this running.

like image 2
IsaacKleiner Avatar answered Oct 24 '22 11:10

IsaacKleiner