I'm lemmatizing the Ted Dataset Transcript. There's something strange I notice: Not all words are being lemmatized. To say,
selected -> select
Which is right.
However, involved !-> involve
and horsing !-> horse
unless I explicitly input the 'v' (Verb) attribute.
On the python terminal, I get the right output but not in my code:
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk.corpus import wordnet
>>> lem = WordNetLemmatizer()
>>> lem.lemmatize('involved','v')
u'involve'
>>> lem.lemmatize('horsing','v')
u'horse'
The relevant section of the code is this:
for l in LDA_Row[0].split('+'):
w=str(l.split('*')[1])
word=lmtzr.lemmatize(w)
wordv=lmtzr.lemmatize(w,'v')
print wordv, word
# if word is not wordv:
# print word, wordv
The whole code is here.
What is the problem?
2. Wordnet Lemmatizer with NLTK. Wordnet is an large, freely and publicly available lexical database for the English language aiming to establish structured semantic relationships between words. It offers lemmatization capabilities as well and is one of the earliest and most commonly used lemmatizers.
Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word.
The lemmatizer requires the correct POS tag to be accurate, if you use the default settings of the WordNetLemmatizer.lemmatize()
, the default tag is noun, see https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39
To resolve the problem, always POS-tag your data before lemmatizing, e.g.
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag, word_tokenize
>>> wnl = WordNetLemmatizer()
>>> sent = 'This is a foo bar sentence'
>>> pos_tag(word_tokenize(sent))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN')]
>>> for word, tag in pos_tag(word_tokenize(sent)):
... wntag = tag[0].lower()
... wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
... if not wntag:
... lemma = word
... else:
... lemma = wnl.lemmatize(word, wntag)
... print lemma
...
This
be
a
foo
bar
sentence
Note that 'is -> be', i.e.
>>> wnl.lemmatize('is')
'is'
>>> wnl.lemmatize('is', 'v')
u'be'
To answer the question with words from your examples:
>>> sent = 'These sentences involves some horsing around'
>>> for word, tag in pos_tag(word_tokenize(sent)):
... wntag = tag[0].lower()
... wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
... lemma = wnl.lemmatize(word, wntag) if wntag else word
... print lemma
...
These
sentence
involve
some
horse
around
Note that there are some quirks with WordNetLemmatizer:
Also NLTK's default POS tagger is under-going some major changes to improve accuracy:
And for an out-of-the-box / off-the-shelf solution to lemmatizer, you can take a look at https://github.com/alvations/pywsd and how I've made some try-excepts to catch words that are not in WordNet, see https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With