Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Diminutive words stemming / lemmatization

Currently I use 'lucene' and 'elasticsearch', and have next problem. I need get stemmed form or lemma for diminutive word. For instance :

  • doggy -> dog
  • kitty -> cat

etc.

But I get next results :

  • doggy -> doggi
  • kitty -> kitti

Is there any way (not important ready to use library, any algorithm, approach etc.) to get root / original word form for diminutive word forms?

Target language : Russian. For example :

  • собачка -> собака
  • кошечка -> кошка

Thanks in advance!

like image 857
Ivan Kurchenko Avatar asked Sep 09 '14 09:09

Ivan Kurchenko


People also ask

What is difference between stemming and lemmatization?

Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. For instance, stemming the word 'Caring' would return 'Car'.

Should I do both lemmatization and stemming?

Short answer- go with stemming when the vocab space is small and the documents are large. Conversely, go with word embeddings when the vocab space is large but the documents are small. However, don't use lemmatization as the increased performance to increased cost ratio is quite low.

Which one is better stemming or lemmatization?

Instead, lemmatization provides better results by performing an analysis that depends on the word's part-of-speech and producing real, dictionary words. As a result, lemmatization is harder to implement and slower compared to stemming.

How do you use stemming and lemmatization?

Difference between Stemming & Lemmatization PorterStemmer class chops off the 'es' from the word. On the other hand, WordNetLemmatizer class finds a valid word. In simple words, stemming technique only looks at the form of the word whereas lemmatization technique looks at the meaning of the word.

What is the difference between stemming and lemmatization?

The output of both programs tells the major difference between stemming and lemmatization. PorterStemmer class chops off the ‘es’ from the word. On the other hand, WordNetLemmatizer class finds a valid word. In simple words, stemming technique only looks at the form of the word whereas lemmatization technique looks at the meaning of the word.

What is lemmatization?

In simpler forms,a method that switches any kind of a word to its base root mode is called Lemmatization. In other words, Lemmatization is a method responsible for grouping different inflected forms of words into the root form, having the same meaning.

Why are dictionaries lists of lemmas and not stems?

This is why regular dictionaries are lists of lemmas, not stems. This has two consequences: First, the stem can be the same for the inflectional forms of different lemmas. This translates into noise in our search results. In fact, it is very common to find entire forms as instances of several lemmas; let’s see some examples.

What is stemming in English grammar?

Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached. This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required.


1 Answers

Firstly, as a side note: What you're trying to do isn't typically called stemming or lemmatiziation.

Your first issue would be mapping the token observed (e.g. собачка) to its normalised form (e.g. собака)-- Naively, this could be done by creating a SynonymFilter which uses a SynonymMap mapping dimunitive forms to their canonical forms. However, you'll likely run into problems with any natural language because not all derivations are unambiguous: For example, in German, Mädel ('girl'/'lass') could be a diminutive form of Magd (an archaic word meaning 'young woman'/'maid') or of Made ('maggot').

One way of disambiguating these two forms would be to calculate the probability of each canonical form appearing in the given context (e.g. the history of the preceding n tokens) and then replacing the dimunitive form with the most probable canonical form (using a custom-made TokenFilter to do so)-- See e.g. the Wikipedia entry for word-sense disambiguation for different approaches.

like image 105
errantlinguist Avatar answered Oct 06 '22 14:10

errantlinguist