Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The reverse process of stemming

I use a lucene snowball analyzer to perform stemming . The results are not meaningful words . I referred this question .

One of the solution is to use a database that contains a map between the stemmed version of the word to one stable version of the word . (Example from communiti to community no matter what the base was for communti (communities / or some other word))

I want to know if there is a database which performs such a function.

like image 627
CTsiddharth Avatar asked Feb 28 '12 11:02

CTsiddharth


2 Answers

It is theoretically impossible to recover a specific word from a stem, since one stem can be common to many words. One possibility, depending on your application, would be to build a database of stems each mapped to an array of several words. But you would then need to predict which one of those words is appropriate given a stem to re-convert.

As a very naive solution to this problem, if you know the word tags, you could try storing words with the tags in your database:

run:
   NN:  runner
   VBG: running
   VBZ: runs

Then, given the stem "run" and the tag "NN", you could determine that "runner" is the most probable word in that context. Of course, that solution is far from perfect. Notably, you'd need to handle the fact that the same word form might be tagged differently in different contexts. But remember that any attempt to solve this problem will be, at best, an approximation.

Edit: from the comments below, it looks like you probably want to use lemmatization instead of stemming. Here's how to get the lemmas of words using the Stanford Core NLP tools:

import java.util.*;

import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ling.CoreAnnotations.*;

Properties props = new Properties();

props.put("annotators", "tokenize, ssplit, pos, lemma");
pipeline = new StanfordCoreNLP(props, false);
String text = "Hello, world!";
Annotation document = pipeline.process(text);

for(CoreMap sentence: document.get(SentencesAnnotation.class)) {
    for(CoreLabel token: sentence.get(TokensAnnotation.class)) {
        String word = token.get(TextAnnotation.class);
        String lemma = token.get(LemmaAnnotation.class);
    }
}
like image 199
user2398029 Avatar answered Oct 08 '22 05:10

user2398029


The question you are referencing contains an important piece of information which is often overlooked. What you require is known as "lemmatisation"- the reduction of inflected words to their canonical form. It is related but different from stemming and is still an open research question. It is particularly hard for languages with more complex morphology (English is not that hard). Wikipedia has a list of software you can try. Another tool I have used is TreeTagger- it is really fast and reasonably accurate, although it primary purpose is part-of-speech tagging and lemmatisation is just a bonus. Try googling for "statistical lemmatisation" (yes, I do have strong feelings about the statistical vs rule-based NLP)

like image 42
mbatchkarov Avatar answered Oct 08 '22 05:10

mbatchkarov