I'm looking for a library which can perform a morphological analysis on German words, i.e. it converts any word into its root form and providing meta information about the analysed word.
For example:
gegessen -> essen
wurde [...] gefasst -> fassen
Häuser -> Haus
Hunde -> Hund
My wishlist:
EDIT: I'm aware that there is no way to perform a morphological analysis without any dictionary at all, because of the irregular words. When I say, I prefer a library without a dictionary I mean those full blown dictionaries which map each and every word:
arbeite -> arbeiten
arbeitest -> arbeiten
arbeitet -> arbeiten
arbeitete -> arbeiten
arbeitetest -> arbeiten
arbeiteten -> arbeiten
arbeitetet -> arbeiten
gearbeitet -> arbeiten
arbeite -> arbeiten
...
Those dictionaries have several drawbacks, including the huge size and the inability to process unknown words.
Of course all exceptions can only be handled with a dictionary:
esse -> essen
isst -> essen
eßt -> essen
aß -> essen
aßt -> essen
aßen -> essen
...
(My mind is spinning right now :) )
I think you are looking for a "stemming algorithm".
Martin Porter's approach is well known among linguists. The Porter stemmer is basically an affix stripping algorithm, combined with a few substitution rules for those special cases.
Most stemmers deliver stems that are linguistically "incorrect". For example: both "beautiful" and "beauty" can result in the stem "beauti", which, of course, is not a real word. This doesn't matter, though, if you're using those stems to improve search results in information retrieval systems. Lucene comes with support for the Porter stemmer, for instance.
Porter also devised a simple programming language for developing stemmers, called Snowball.
There are also stemmers for German available in Snowball. A C version, generated from the Snowball source, is also available on the website, along with a plain text explanation of the algorithm.
Here's the German stemmer in Snowball: http://snowball.tartarus.org/algorithms/german/stemmer.html
If you're looking for the corresponding stem of a word as you would find it in a dictionary, along with information on the part of speech, you should Google for "lemmatization".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With