Currently I use 'lucene' and 'elasticsearch', and have next problem. I need get stemmed form or lemma for diminutive word. For instance :
etc.
But I get next results :
Is there any way (not important ready to use library, any algorithm, approach etc.) to get root / original word form for diminutive word forms?
Target language : Russian. For example :
Thanks in advance!
Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. For instance, stemming the word 'Caring' would return 'Car'.
Short answer- go with stemming when the vocab space is small and the documents are large. Conversely, go with word embeddings when the vocab space is large but the documents are small. However, don't use lemmatization as the increased performance to increased cost ratio is quite low.
Instead, lemmatization provides better results by performing an analysis that depends on the word's part-of-speech and producing real, dictionary words. As a result, lemmatization is harder to implement and slower compared to stemming.
Difference between Stemming & Lemmatization PorterStemmer class chops off the 'es' from the word. On the other hand, WordNetLemmatizer class finds a valid word. In simple words, stemming technique only looks at the form of the word whereas lemmatization technique looks at the meaning of the word.
The output of both programs tells the major difference between stemming and lemmatization. PorterStemmer class chops off the ‘es’ from the word. On the other hand, WordNetLemmatizer class finds a valid word. In simple words, stemming technique only looks at the form of the word whereas lemmatization technique looks at the meaning of the word.
In simpler forms,a method that switches any kind of a word to its base root mode is called Lemmatization. In other words, Lemmatization is a method responsible for grouping different inflected forms of words into the root form, having the same meaning.
This is why regular dictionaries are lists of lemmas, not stems. This has two consequences: First, the stem can be the same for the inflectional forms of different lemmas. This translates into noise in our search results. In fact, it is very common to find entire forms as instances of several lemmas; let’s see some examples.
Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached. This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required.
Firstly, as a side note: What you're trying to do isn't typically called stemming or lemmatiziation.
Your first issue would be mapping the token observed (e.g. собачка) to its normalised form (e.g. собака)-- Naively, this could be done by creating a SynonymFilter
which uses a SynonymMap
mapping dimunitive forms to their canonical forms. However, you'll likely run into problems with any natural language because not all derivations are unambiguous: For example, in German, Mädel ('girl'/'lass') could be a diminutive form of Magd (an archaic word meaning 'young woman'/'maid') or of Made ('maggot').
One way of disambiguating these two forms would be to calculate the probability of each canonical form appearing in the given context (e.g. the history of the preceding n tokens) and then replacing the dimunitive form with the most probable canonical form (using a custom-made TokenFilter
to do so)-- See e.g. the Wikipedia entry for word-sense disambiguation for different approaches.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With