Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

WordNetLemmatizer not returning the right lemma unless POS is explicit - Python NLTK

I'm lemmatizing the Ted Dataset Transcript. There's something strange I notice: Not all words are being lemmatized. To say,

selected -> select

Which is right.

However, involved !-> involve and horsing !-> horse unless I explicitly input the 'v' (Verb) attribute.

On the python terminal, I get the right output but not in my code:

>>> from nltk.stem import WordNetLemmatizer
>>> from nltk.corpus import wordnet
>>> lem = WordNetLemmatizer()
>>> lem.lemmatize('involved','v')
u'involve'
>>> lem.lemmatize('horsing','v')
u'horse'

The relevant section of the code is this:

for l in LDA_Row[0].split('+'):
    w=str(l.split('*')[1])
    word=lmtzr.lemmatize(w)
    wordv=lmtzr.lemmatize(w,'v')
    print wordv, word
    # if word is not wordv:
    #   print word, wordv

The whole code is here.

What is the problem?

like image 475
FlyingAura Avatar asked Oct 05 '15 21:10

FlyingAura


People also ask

What is WordNetLemmatizer NLTK?

2. Wordnet Lemmatizer with NLTK. Wordnet is an large, freely and publicly available lexical database for the English language aiming to establish structured semantic relationships between words. It offers lemmatization capabilities as well and is one of the earliest and most commonly used lemmatizers.

What is WordNetLemmatizer in Python?

Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word.


1 Answers

The lemmatizer requires the correct POS tag to be accurate, if you use the default settings of the WordNetLemmatizer.lemmatize(), the default tag is noun, see https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39

To resolve the problem, always POS-tag your data before lemmatizing, e.g.

>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag, word_tokenize
>>> wnl = WordNetLemmatizer()
>>> sent = 'This is a foo bar sentence'
>>> pos_tag(word_tokenize(sent))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN')]
>>> for word, tag in pos_tag(word_tokenize(sent)):
...     wntag = tag[0].lower()
...     wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
...     if not wntag:
...             lemma = word
...     else:
...             lemma = wnl.lemmatize(word, wntag)
...     print lemma
... 
This
be
a
foo
bar
sentence

Note that 'is -> be', i.e.

>>> wnl.lemmatize('is')
'is'
>>> wnl.lemmatize('is', 'v')
u'be'

To answer the question with words from your examples:

>>> sent = 'These sentences involves some horsing around'
>>> for word, tag in pos_tag(word_tokenize(sent)):
...     wntag = tag[0].lower()
...     wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
...     lemma = wnl.lemmatize(word, wntag) if wntag else word
...     print lemma
... 
These
sentence
involve
some
horse
around

Note that there are some quirks with WordNetLemmatizer:

  • wordnet lemmatization and pos tagging in python
  • Python NLTK Lemmatization of the word 'further' with wordnet

Also NLTK's default POS tagger is under-going some major changes to improve accuracy:

  • Python NLTK pos_tag not returning the correct part-of-speech tag
  • https://github.com/nltk/nltk/issues/1110
  • https://github.com/nltk/nltk/pull/1143

And for an out-of-the-box / off-the-shelf solution to lemmatizer, you can take a look at https://github.com/alvations/pywsd and how I've made some try-excepts to catch words that are not in WordNet, see https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66

like image 97
alvas Avatar answered Nov 15 '22 21:11

alvas