Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

German Stemming for Sentiment Analysis in Python NLTK

I've recently begun working on a sentiment analysis project on German texts and I'm planning on using a stemmer to improve the results.

NLTK comes with a German Snowball Stemmer and I've already tried to use it, but I'm unsure about the results. Maybe it should be this way, but as a computer scientist and not a linguist, I have a problem with inflected verb forms stemmed to a different stem.

Take the word "suchen" (to search), which is stemmed to "such" for 1st person singular but to "sucht" for 3rd person singular.

I know there is also lemmatization, but no working German lemmatizer is integrated into NLTK as far as I know. There is GermaNet, but their NLTK integration seems to have been aborted.

Getting to the point: I would like inflected verb forms to be stemmed to the same stem, at the very least for regular verbs within the same tense. If this is not a useful requirement for my goal, please tell me why. If it is, do you know of any additional resources to use which can help me achieve this goal?

Edit: I forgot to mention, any software should be free to use for educational and research purposes.

like image 610
Florian Avatar asked Jun 13 '17 13:06

Florian


People also ask

Which NLTK package can be used for stemming?

nltk. stem is a package that performs stemming using different classes.

What is stemming in NLTK?

Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers.

Which Stemmer is the best?

Snowball stemmer: This algorithm is also known as the Porter2 stemming algorithm. It is almost universally accepted as better than the Porter stemmer, even being acknowledged as such by the individual who created the Porter stemmer.


2 Answers

As a computer scientist, you are definitely looking in the right direction to tackle this linguistic issue ;). Stemming is usually quite a bit more simplistic, and used for Information Retrieval tasks in an attempt to decrease the lexicon size, but usually not sufficient for more sophisticated linguistic analysis. Lemmatisation partly overlaps with the use case for stemming, but includes rewriting for example verb inflections all to the same root form (lemma), and also differentiating "work" as a noun and "work" as a verb (although this depends a bit on the implementation and quality of the lemmatiser). For this, it usually needs a bit more information (like POS-tags, syntax trees), hence takes considerably longer, rendering it less suitable for IR tasks, typically dealing with larger amounts of data.

In addition to GermaNet (didn't know it was aborted, but never really tried it, because it is free, but you have to sign an agreement to get access to it), there is SpaCy which you could have a look at: https://spacy.io/docs/usage/

Very easy to install and use. See install instructions on the website, then download the German stuff using:

python -m spacy download de

then:

>>> import spacy
>>> nlp = spacy.load('de')
>>> doc = nlp('Wir suchen ein Beispiel')
>>> for token in doc:
...     print(token, token.lemma, token.lemma_)
... 
Wir 521 wir
suchen 1162 suchen
ein 486 ein
Beispiel 809 Beispiel
>>> doc = nlp('Er sucht ein Beispiel')
>>> for token in doc:
...     print(token, token.lemma, token.lemma_)
... 
Er 513 er
sucht 1901 sucht
ein 486 ein
Beispiel 809 Beispiel

As you can see, unfortunately it doesn't do a very good job on your specific example (suchen), and I'm not sure what the number represents (i.e. must be the lemma id, but not sure what other information can be obtained from this), but maybe you can give it a go and see if it helps you.

like image 168
Igor Avatar answered Oct 19 '22 05:10

Igor


A good and easy solution is to use the TreeTagger. First you have to install the treetagge manually (which is basically unzipping the right zip-file somewhere on your computer). You will find the binary distribution here: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

Then you need to install a wrapper to call it fron Python.

The folowing code installs the wrapper and lemmatizes a tokenized sentence:

import treetaggerwrapper

tagger = treetaggerwrapper.TreeTagger(TAGLANG='de')

tags = tagger.tag_text(tokenized_sent,tagonly=True) #don't use the TreeTagger's tokenization!

pprint.pprint(tags)

You also can use a method form the treetaggerwrapper to make nice objects out of the Treetagges output:

tags2 = treetaggerwrapper.make_tags(tags)
pprint.pprint(tags2)

That is all.

like image 42
Christian Wartena Avatar answered Oct 19 '22 06:10

Christian Wartena