German Stemming for Sentiment Analysis in Python NLTK

Tags:

I've recently begun working on a sentiment analysis project on German texts and I'm planning on using a stemmer to improve the results.

NLTK comes with a German Snowball Stemmer and I've already tried to use it, but I'm unsure about the results. Maybe it should be this way, but as a computer scientist and not a linguist, I have a problem with inflected verb forms stemmed to a different stem.

Take the word "suchen" (to search), which is stemmed to "such" for 1st person singular but to "sucht" for 3rd person singular.

I know there is also lemmatization, but no working German lemmatizer is integrated into NLTK as far as I know. There is GermaNet, but their NLTK integration seems to have been aborted.

Getting to the point: I would like inflected verb forms to be stemmed to the same stem, at the very least for regular verbs within the same tense. If this is not a useful requirement for my goal, please tell me why. If it is, do you know of any additional resources to use which can help me achieve this goal?

Edit: I forgot to mention, any software should be free to use for educational and research purposes.

610

asked Jun 13 '17 13:06

Florian

2 Answers

As a computer scientist, you are definitely looking in the right direction to tackle this linguistic issue ;). Stemming is usually quite a bit more simplistic, and used for Information Retrieval tasks in an attempt to decrease the lexicon size, but usually not sufficient for more sophisticated linguistic analysis. Lemmatisation partly overlaps with the use case for stemming, but includes rewriting for example verb inflections all to the same root form (lemma), and also differentiating "work" as a noun and "work" as a verb (although this depends a bit on the implementation and quality of the lemmatiser). For this, it usually needs a bit more information (like POS-tags, syntax trees), hence takes considerably longer, rendering it less suitable for IR tasks, typically dealing with larger amounts of data.

In addition to GermaNet (didn't know it was aborted, but never really tried it, because it is free, but you have to sign an agreement to get access to it), there is SpaCy which you could have a look at: https://spacy.io/docs/usage/

Very easy to install and use. See install instructions on the website, then download the German stuff using:

python -m spacy download de

then:

>>> import spacy
>>> nlp = spacy.load('de')
>>> doc = nlp('Wir suchen ein Beispiel')
>>> for token in doc:
...     print(token, token.lemma, token.lemma_)
... 
Wir 521 wir
suchen 1162 suchen
ein 486 ein
Beispiel 809 Beispiel
>>> doc = nlp('Er sucht ein Beispiel')
>>> for token in doc:
...     print(token, token.lemma, token.lemma_)
... 
Er 513 er
sucht 1901 sucht
ein 486 ein
Beispiel 809 Beispiel

As you can see, unfortunately it doesn't do a very good job on your specific example (suchen), and I'm not sure what the number represents (i.e. must be the lemma id, but not sure what other information can be obtained from this), but maybe you can give it a go and see if it helps you.

168

answered Oct 19 '22 05:10

Igor

A good and easy solution is to use the TreeTagger. First you have to install the treetagge manually (which is basically unzipping the right zip-file somewhere on your computer). You will find the binary distribution here: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

Then you need to install a wrapper to call it fron Python.

The folowing code installs the wrapper and lemmatizes a tokenized sentence:

import treetaggerwrapper

tagger = treetaggerwrapper.TreeTagger(TAGLANG='de')

tags = tagger.tag_text(tokenized_sent,tagonly=True) #don't use the TreeTagger's tokenization!

pprint.pprint(tags)

You also can use a method form the treetaggerwrapper to make nice objects out of the Treetagges output:

tags2 = treetaggerwrapper.make_tags(tags)
pprint.pprint(tags2)

That is all.

answered Oct 19 '22 06:10

Christian Wartena

Related questions
                            
                                Replace duplicate values across columns in Pandas
                            
                                airflow startup failed due to gunicorn
                            
                                How to check if a CSV has a header using Python?
                            
                                Convert a numpy array of lists to a numpy array
                            
                                Select data when specific columns have null value in pandas
                            
                                How does one enter a Python virtualenv when executing a bashscript?
                            
                                How to drop the index column while writing the DataFrame in a .csv file in Pandas? [duplicate]
                            
                                Using url_for in tests
                            
                                Find string within JSON with Python
                            
                                Pandas use and operator in LOC function
                            
                                How should we pad text sequence in keras using pad_sequences?
                            
                                How to detect current keyboard language in python
                            
                                How can I see the formulas of an excel spreadsheet in pandas / python?
                            
                                Why we need python packaging (e.g. egg)? [duplicate]
                            
                                How can I create a DataFrame slice object piece by piece?
                            
                                Pandas GroupBy: apply a function with two arguments
                            
                                Deriving an ECDSA uncompressed public key from a compressed one
                            
                                python unicode rendering: how to know if a unicode character is missing from the font
                            
                                How can I update a .yml file, ignoring preexisting Jinja syntax, using Python?
                            
                                How to make nested for loop more Pythonic

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

German Stemming for Sentiment Analysis in Python NLTK

Tags:

python

nltk

stemming

sentiment-analysis

snowball

Florian

People also ask

2 Answers

Igor

Christian Wartena

Recent Activity

Donate For Us