Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting the root word using the Wordnet Lemmatizer

I need to find a common root word matched for all related words for a keyword extractor.

How to convert words into the same root using the python nltk lemmatizer?

  • Eg:
    1. generalized, generalization -> general
    2. optimal, optimized -> optimize (maybe)
    3. configure, configuration, configured -> configure

The python nltk lemmatizer gives 'generalize', for 'generalized' and 'generalizing' when part of speech(pos) tag parameter is used but not for 'generalization'.

Is there a way to do this?

like image 224
Shanika Ediriweera Avatar asked Sep 03 '16 03:09

Shanika Ediriweera


1 Answers

Use SnowballStemmer:

>>> from nltk.stem.snowball import SnowballStemmer
>>> stemmer = SnowballStemmer("english")
>>> print(stemmer.stem("generalized"))
general
>>> print(stemmer.stem("generalization"))
general

Note: Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.

A general issue I have seen with lemmatizers is that it identifies even bigger words as lemmas.

Example: In WordNet Lemmatizer(checked in NLTK),

  • Genralized => Generalize
  • Generalization => Generalization
  • Generalizations => Generalization

POS tag was not given as input in the above cases, so it was always considered noun.

like image 58
Ani Menon Avatar answered Sep 18 '22 02:09

Ani Menon