Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I do word Stemming or Lemmatization?

I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.

My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.

See also:

  • Stemming algorithm that produces real words
  • Stemming - code examples or open source projects?
like image 737
manixrock Avatar asked Apr 21 '09 10:04

manixrock


People also ask

What is the lemmatization and stemming of the word coming?

2021 Jul.07. Stemming and lemmatization are methods used by search engines and chatbots to analyze the meaning behind a word. Stemming uses the stem of the word, while lemmatization uses the context in which the word is being used.

How do you lemmatize words?

In order to lemmatize, you need to create an instance of the WordNetLemmatizer() and call the lemmatize() function on a single word. Let's lemmatize a simple sentence. We first tokenize the sentence into words using nltk. word_tokenize and then we will call lemmatizer.

Can I do both stemming and lemmatization?

I think stemming a lemmatized word is redundant if you get the same result than just stemming it (which is the result I expect). Nevertheless, the decision between stemmer and lemmatizer depends on your need. My intuition said that steamming increses recall and lowers precision and the opposite for a lemmatization.

What is lemmatization in simple words?

Lemmatization is the grouping together of different forms of the same word. In search queries, lemmatization allows end users to query any version of a base word and get relevant results.


1 Answers

If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet.

Note that if you are using this lemmatizer for the first time, you must download the corpus prior to using it. This can be done by:

>>> import nltk >>> nltk.download('wordnet') 

You only have to do this once. Assuming that you have now downloaded the corpus, it works like this:

>>> from nltk.stem.wordnet import WordNetLemmatizer >>> lmtzr = WordNetLemmatizer() >>> lmtzr.lemmatize('cars') 'car' >>> lmtzr.lemmatize('feet') 'foot' >>> lmtzr.lemmatize('people') 'people' >>> lmtzr.lemmatize('fantasized','v') 'fantasize' 

There are other lemmatizers in the nltk.stem module, but I haven't tried them myself.

like image 155
theycallmemorty Avatar answered Sep 19 '22 13:09

theycallmemorty