POS tagging in German

Tags:

I am using NLTK to extract nouns from a text-string starting with the following command:

tagged_text = nltk.pos_tag(nltk.Text(nltk.word_tokenize(some_string)))

It works fine in English. Is there an easy way to make it work for German as well?

(I have no experience with natural language programming, but I managed to use the python nltk library which is great so far.)

268

asked Oct 28 '09 20:10

Johannes Meier

5 Answers

Natural language software does its magic by leveraging corpora and the statistics they provide. You'll need to tell nltk about some German corpus to help it tokenize German correctly. I believe the EUROPARL corpus might help get you going.

See nltk.corpus.europarl_raw and this answer for example configuration.

Also, consider tagging this question with "nlp".

187

answered Oct 03 '22 13:10

Mike Atlas

The Pattern library includes a function for parsing German sentences and the result includes the part-of-speech tags. The following is copied from their documentation:

Click to copy

from pattern.de import parse, split
s = parse('Die Katze liegt auf der Matte.')
s = split(s)
print s.sentences[0]

>>>   Sentence('Die/DT/B-NP/O Katze/NN/I-NP/O liegt/VB/B-VP/O'
     'auf/IN/B-PP/B-PNP der/DT/B-NP/I-PNP Matte/NN/I-NP/I-PNP ././O/O')

Update: Another option is spacy, there is a quick example in this blog article:

Click to copy

import spacy

nlp = spacy.load('de')
doc = nlp(u'Ich bin ein Berliner.')

# show universal pos tags
print(' '.join('{word}/{tag}'.format(word=t.orth_, tag=t.pos_) for t in doc))
# output: Ich/PRON bin/AUX ein/DET Berliner/NOUN ./PUNCT

answered Oct 03 '22 14:10

Suzana

Part-of-Speech (POS) tagging is very specific to a particular [natural] language. NLTK includes many different taggers, which use distinct techniques to infer the tag of a given token in a given token. Most (but not all) of these taggers use a statistical model of sorts as the main or sole device to "do the trick". Such taggers require some "training data" upon which to build this statistical representation of the language, and the training data comes in the form of corpora.

The NTLK "distribution" itself includes many of these corpora, as well a set of "corpora readers" which provide an API to read different types of corpora. I don't know the state of affairs in NTLK proper, and if this includes any german corpus. You can however locate free some free corpora which you'll then need to convert to a format that satisfies the proper NTLK corpora reader, and then you can use this to train a POS tagger for the German language.

You can even create your own corpus, but that is a hell of a painstaking job; if you work in a univeristy, you gotta find ways of bribing and otherwise coercing students to do that for you ;-)

answered Oct 03 '22 14:10

mjv

Possibly you can use the Stanford POS tagger. Below is a recipe I wrote. There are python recipes for German NLP that I've compiled and you can access them on http://htmlpreview.github.io/?https://github.com/alvations/DLTK/blob/master/docs/index.html

Click to copy

#-*- coding: utf8 -*-

import os, glob, codecs

def installStanfordTag():
    if not os.path.exists('stanford-postagger-full-2013-06-20'):
        os.system('wget http://nlp.stanford.edu/software/stanford-postagger-full-2013-06-20.zip')
        os.system('unzip stanford-postagger-full-2013-06-20.zip')
    return

def tag(infile):
    cmd = "./stanford-postagger.sh "+models[m]+" "+infile
    tagout = os.popen(cmd).readlines()
    return [i.strip() for i in tagout]

def taglinebyline(sents):
    tagged = []
    for ss in sents:
        os.popen("echo '''"+ss+"''' > stanfordtemp.txt")
        tagged.append(tag('stanfordtemp.txt')[0])
    return tagged

installStanfordTag()
stagdir = './stanford-postagger-full-2013-06-20/'
models = {'fast':'models/german-fast.tagger',
          'dewac':'models/german-dewac.tagger',
          'hgc':'models/german-hgc.tagger'}
os.chdir(stagdir)
print os.getcwd()


m = 'fast' # It's best to use the fast german tagger if your data is small.

sentences = ['Ich bin schwanger .','Ich bin wieder schwanger .','Ich verstehe nur Bahnhof .']

tagged_sents = taglinebyline(sentences) # Call the stanford tagger

for sent in tagged_sents:
    print sent

answered Oct 03 '22 14:10

alvas

I have written a blog-post about how to convert the German annotated TIGER Corpus in order to use it with the NLTK. Have a look at it here.

answered Oct 03 '22 13:10

Philipp

Related questions
                            
                                How to debug a Python module in Visual Studio Code's launch.json
                            
                                How to create a udf in PySpark which returns an array of strings?
                            
                                How to get a single value as a string from pandas data frame
                            
                                Pandas dataframe select rows where a list-column contains any of a list of strings
                            
                                SVG rendering in a PyGame application
                            
                                python sqlalchemy + postgresql program freezes
                            
                                Overlay two same sized images in Python
                            
                                How to unzip file in Python on all OSes?
                            
                                How to capture a video (AND audio) in python, from a camera (or webcam)
                            
                                Disable the underlying window when a popup is created in Python TKinter
                            
                                Pip install python package into a specific directory other than the default install location
                            
                                Grammatical List Join in Python [duplicate]
                            
                                Interpolation over regular grid in Python [closed]
                            
                                setup.py sdist exclude packages in subdirectory
                            
                                Add a parameter into kwargs during function call?
                            
                                Access config values in Flask from other files
                            
                                How to delete pages from pdf file using Python?
                            
                                How to calculate time difference by group using pandas?
                            
                                Catching boto3 ClientError subclass
                            
                                How to serialize Python objects in a human-readable format? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

POS tagging in German

Tags:

python

nlp

nltk

Johannes Meier

People also ask

5 Answers

Mike Atlas

Suzana

mjv

alvas

Philipp

Recent Activity

Donate For Us