Currently I use 'lucene' and 'elasticsearch', and have next problem. I need get stemmed form or lemma for diminutive word. For instance : <ul> <li> doggy -> dog </li> <li>kitty -> cat</li> </ul> etc. But I get next results : <ul> <li> doggy -> doggi </li> <li>kitty -> kitti</li> </ul> Is there any way (not important ready to use library, any algorithm, approach etc.) to get root / original word form for diminutive word forms? Target language : Russian. For example : <ul> <li>собачка -> собака</li> <li>кошечка -> кошка</li> </ul> Thanks in advance!

Firstly, as a side note: What you're trying to do isn't typically called stemming or lemmatiziation. Your first issue would be mapping the token observed (e.g. собачка) to its normalised form (e.g. собака)-- Naively, this could be done by creating a <code>SynonymFilter</code> which uses a <code>SynonymMap</code> mapping dimunitive forms to their canonical forms. However, you'll likely run into problems with any natural language because not all derivations are unambiguous: For example, in German, Mädel ('girl'/'lass') could be a diminutive form of Magd (an archaic word meaning 'young woman'/'maid') or of Made ('maggot'). One way of disambiguating these two forms would be to calculate the probability of each canonical form appearing in the given context (e.g. the history of the preceding n tokens) and then replacing the dimunitive form with the most probable canonical form (using a custom-made <code>TokenFilter</code> to do so)-- See e.g. the Wikipedia entry for word-sense disambiguation for different approaches.

Diminutive words stemming / lemmatization

1 Answers

Firstly, as a side note: What you're trying to do isn't typically called stemming or lemmatiziation.

Your first issue would be mapping the token observed (e.g. собачка) to its normalised form (e.g. собака)-- Naively, this could be done by creating a SynonymFilter which uses a SynonymMap mapping dimunitive forms to their canonical forms. However, you'll likely run into problems with any natural language because not all derivations are unambiguous: For example, in German, Mädel ('girl'/'lass') could be a diminutive form of Magd (an archaic word meaning 'young woman'/'maid') or of Made ('maggot').

One way of disambiguating these two forms would be to calculate the probability of each canonical form appearing in the given context (e.g. the history of the preceding n tokens) and then replacing the dimunitive form with the most probable canonical form (using a custom-made TokenFilter to do so)-- See e.g. the Wikipedia entry for word-sense disambiguation for different approaches.

105

answered Oct 06 '22 14:10

errantlinguist

Related questions
                            
                                LibGDX FreeType font blurry
                            
                                JAX-RS resource error: Couldn't find JAX-B element for class java.lang.String and some more exceptions
                            
                                Log EL/JSP exceptions in tomcat
                            
                                How to use a java 8 library with android?
                            
                                Android File To Base64 using streaming sometimes missed 2 bytes
                            
                                Is there a versioning process for the Javadoc Tool?
                            
                                Dragging rotated text inside android canvas does not work as expected
                            
                                GridBagLayout not aligning images properly
                            
                                How-to provide a ThreadLocal to CompletableFutures?
                            
                                Get a file in a jar in jar
                            
                                How to upgrade Apache HttpClient Versions on Android
                            
                                How can this loop ever exit?
                            
                                could not set a field value by reflection setter
                            
                                Why we need Caretaker class in Memento Pattern? Is it really so important?
                            
                                Terminology for relationships defined by interfaces
                            
                                Hosting multiple domains with WildFly (Undertow)
                            
                                Empty loop consuming more memory than non empty loop in java
                            
                                logcat filled with java.io.IOException: Connection refused messages
                            
                                Hazelcast query in custom objects
                            
                                Is KeyPairGenerator.generateKeyPair() thread safe?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Diminutive words stemming / lemmatization

Tags:

java

lucene

elasticsearch

nlp

morphological-analysis

Ivan Kurchenko

People also ask

1 Answers

errantlinguist

Recent Activity

Donate For Us