I'm lemmatizing the Ted Dataset Transcript. There's something strange I notice: Not all words are being lemmatized. To say, <pre class="prettyprint"><code>selected -> select </code></pre> Which is right. However, <code>involved !-> involve</code> and <code>horsing !-> horse</code> unless I explicitly input the 'v' (Verb) attribute. On the python terminal, I get the right output but not in my code: <pre class="prettyprint"><code>>>> from nltk.stem import WordNetLemmatizer >>> from nltk.corpus import wordnet >>> lem = WordNetLemmatizer() >>> lem.lemmatize('involved','v') u'involve' >>> lem.lemmatize('horsing','v') u'horse' </code></pre> The relevant section of the code is this: <pre class="prettyprint"><code>for l in LDA_Row[0].split('+'): w=str(l.split('*')[1]) word=lmtzr.lemmatize(w) wordv=lmtzr.lemmatize(w,'v') print wordv, word # if word is not wordv: # print word, wordv </code></pre> The whole code is here. What is the problem?

The lemmatizer requires the correct POS tag to be accurate, if you use the default settings of the <code>WordNetLemmatizer.lemmatize()</code>, the default tag is noun, see https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39 To resolve the problem, always POS-tag your data before lemmatizing, e.g. <pre class="prettyprint"><code>>>> from nltk.stem import WordNetLemmatizer >>> from nltk import pos_tag, word_tokenize >>> wnl = WordNetLemmatizer() >>> sent = 'This is a foo bar sentence' >>> pos_tag(word_tokenize(sent)) [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN')] >>> for word, tag in pos_tag(word_tokenize(sent)): ... wntag = tag[0].lower() ... wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None ... if not wntag: ... lemma = word ... else: ... lemma = wnl.lemmatize(word, wntag) ... print lemma ... This be a foo bar sentence </code></pre> Note that 'is -> be', i.e. <pre class="prettyprint"><code>>>> wnl.lemmatize('is') 'is' >>> wnl.lemmatize('is', 'v') u'be' </code></pre> To answer the question with words from your examples: <pre class="prettyprint"><code>>>> sent = 'These sentences involves some horsing around' >>> for word, tag in pos_tag(word_tokenize(sent)): ... wntag = tag[0].lower() ... wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None ... lemma = wnl.lemmatize(word, wntag) if wntag else word ... print lemma ... These sentence involve some horse around </code></pre> Note that there are some quirks with WordNetLemmatizer: <ul> <li> wordnet lemmatization and pos tagging in python </li> <li>Python NLTK Lemmatization of the word 'further' with wordnet</li> </ul> Also NLTK's default POS tagger is under-going some major changes to improve accuracy: <ul> <li>Python NLTK pos_tag not returning the correct part-of-speech tag</li> <li>https://github.com/nltk/nltk/issues/1110</li> <li>https://github.com/nltk/nltk/pull/1143</li> </ul> And for an out-of-the-box / off-the-shelf solution to lemmatizer, you can take a look at https://github.com/alvations/pywsd and how I've made some try-excepts to catch words that are not in WordNet, see https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66

WordNetLemmatizer not returning the right lemma unless POS is explicit - Python NLTK

Tags:

python

nlp

nltk

lemmatization

wordnet

I'm lemmatizing the Ted Dataset Transcript. There's something strange I notice: Not all words are being lemmatized. To say,

selected -> select

Which is right.

However, involved !-> involve and horsing !-> horse unless I explicitly input the 'v' (Verb) attribute.

On the python terminal, I get the right output but not in my code:

>>> from nltk.stem import WordNetLemmatizer
>>> from nltk.corpus import wordnet
>>> lem = WordNetLemmatizer()
>>> lem.lemmatize('involved','v')
u'involve'
>>> lem.lemmatize('horsing','v')
u'horse'

The relevant section of the code is this:

for l in LDA_Row[0].split('+'):
    w=str(l.split('*')[1])
    word=lmtzr.lemmatize(w)
    wordv=lmtzr.lemmatize(w,'v')
    print wordv, word
    # if word is not wordv:
    #   print word, wordv

The whole code is here.

What is the problem?

475

asked Oct 05 '15 21:10

FlyingAura

1 Answers

The lemmatizer requires the correct POS tag to be accurate, if you use the default settings of the WordNetLemmatizer.lemmatize(), the default tag is noun, see https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39

To resolve the problem, always POS-tag your data before lemmatizing, e.g.

>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag, word_tokenize
>>> wnl = WordNetLemmatizer()
>>> sent = 'This is a foo bar sentence'
>>> pos_tag(word_tokenize(sent))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN')]
>>> for word, tag in pos_tag(word_tokenize(sent)):
...     wntag = tag[0].lower()
...     wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
...     if not wntag:
...             lemma = word
...     else:
...             lemma = wnl.lemmatize(word, wntag)
...     print lemma
... 
This
be
a
foo
bar
sentence

Note that 'is -> be', i.e.

>>> wnl.lemmatize('is')
'is'
>>> wnl.lemmatize('is', 'v')
u'be'

To answer the question with words from your examples:

>>> sent = 'These sentences involves some horsing around'
>>> for word, tag in pos_tag(word_tokenize(sent)):
...     wntag = tag[0].lower()
...     wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
...     lemma = wnl.lemmatize(word, wntag) if wntag else word
...     print lemma
... 
These
sentence
involve
some
horse
around

Note that there are some quirks with WordNetLemmatizer:

wordnet lemmatization and pos tagging in python
Python NLTK Lemmatization of the word 'further' with wordnet

Also NLTK's default POS tagger is under-going some major changes to improve accuracy:

Python NLTK pos_tag not returning the correct part-of-speech tag
https://github.com/nltk/nltk/issues/1110
https://github.com/nltk/nltk/pull/1143

And for an out-of-the-box / off-the-shelf solution to lemmatizer, you can take a look at https://github.com/alvations/pywsd and how I've made some try-excepts to catch words that are not in WordNet, see https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66

answered Nov 15 '22 21:11

alvas

Related questions
                            
                                selenium move_to_element does not always mouse-hover
                            
                                Python: Munging data with '.join' (TypeError: sequence item 0: expected string, tuple found)
                            
                                How do I inspect one specific object in IPython
                            
                                Visualize Optical Flow with color model
                            
                                Convert Bitstring (String of 1 and 0s) to numpy array
                            
                                Django: extending user model vs creating user profile model
                            
                                '400 Bad Request' when post json in Flask
                            
                                Python pandas summary table plot
                            
                                How to set bandwidth on Mininet custom topology?
                            
                                Serialize Objects with One-to-One Relationship Django
                            
                                Beautifulsoup split text in tag by <br/>
                            
                                Linear programming with scipy.optimize.linprog
                            
                                dtype changes when using DataFrame.to_dict
                            
                                how to multiply pandas dataframe with numpy array with broadcasting
                            
                                python mock global function that is used in class
                            
                                Launch concurrent.futures.ProcessPoolExecutor with initialization?
                            
                                What do -u, -m parameters do?
                            
                                Trouble installing "distribute": NameError: name 'sys_platform' is not defined
                            
                                python pandas parse datetime string with months names
                            
                                how to find JAR: /home/hadoop/contrib/streaming/hadoop-streaming.jar

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With