Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is Spacy lemmatization not working properly or does it not lemmatize all words ending with "-ing"?

Tags:

python

nlp

spacy

When I run the spacy lemmatizer, it does not lemmatize the word "consulting" and therefore I suspect it is failing.

Here is my code:

nlp = spacy.load('en_core_web_trf', disable=['parser', 'ner'])
lemmatizer = nlp.get_pipe('lemmatizer')
doc = nlp('consulting')
print([token.lemma_ for token in doc])

And my output:

['consulting']
like image 584
M_Neelakandan Avatar asked Sep 07 '25 16:09

M_Neelakandan


2 Answers

The spaCy lemmatizer is not failing, it's performing as expected. Lemmatization depends heavily on the Part of Speech (PoS) tag assigned to the token, and PoS tagger models are trained on sentences/documents, not single tokens (words). For example, parts-of-speech.info which is based on the Stanford PoS tagger, does not allow you to enter single words.

In your case, the single word "consulting" is being tagged as a noun, and the spaCy model you are using deems "consulting" to be the appropriate lemma for this case. You'll see if you change your string instead to "consulting tomorrow", spaCy will lemmatize "consulting" to "consult" as it is tagged as a verb (see output from the code below). In short, I recommend not trying to perform lemmatization on single tokens, instead, use the model on sentences/documents as it was intended.

As a side note: make sure you understand the difference between a lemma and a stem. Read this section provided on Wikipedia Lemma (morphology) page if you are unsure.

import spacy
nlp = spacy.load('en_core_web_trf', disable=['parser', 'ner'])
doc = nlp('consulting')
print([[token.pos_, token.lemma_] for token in doc])
# Output: [['NOUN', 'consulting']]
doc_verb = nlp('Consulting tomorrow')
print([[token.pos_, token.lemma_] for token in doc_verb])
# Output: [['VERB', 'consult'], ['NOUN', 'tomorrow']]

If you really need to lemmatize single words, the second approach on this GeeksforGeeks Python lemmatization tutorial produces the lemma "consult". I've created a condensed version of it here for future reference in case the link becomes invalid. I haven't tested it on other single tokens (words) so it may not work for all cases.

# Condensed version of approach #2 given in the GeeksforGeeks lemmatizer tutorial:
# https://www.geeksforgeeks.org/python-lemmatization-approaches-with-examples/
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet


# POS_TAGGER_FUNCTION : TYPE 1
def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None


lemmatizer = WordNetLemmatizer()
sentence = 'consulting'
pos_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
lemmatized_sentence = []
for word, tag in pos_tagged:
    lemmatized_sentence.append(lemmatizer.lemmatize(word, pos_tagger(tag)))
print(lemmatized_sentence)
# Output: ['consult']
like image 70
Kyle F. Hartzenberg Avatar answered Sep 10 '25 07:09

Kyle F. Hartzenberg


spaCy's lemmatizer behaves differently depending on the part of speech. In particular, for nouns, the "-ing" form is considered to be the lemma already, and is not changed.

Here's an example that illustrates the difference:

import spacy

nlp = spacy.load("en_core_web_sm")

text = "While consulting, I sometimes tell people about the consulting business."
for tok in nlp(text):
    print(tok, tok.pos_, tok.lemma_, sep="\t")

Output:

While   SCONJ   while
consulting      VERB    consult
,       PUNCT   ,
I       PRON    I
sometimes       ADV     sometimes
tell    VERB    tell
people  NOUN    people
about   ADP     about
the     DET     the
consulting      NOUN    consulting
business        NOUN    business

See how the verb has "consult" as a lemma, while the noun does not.

like image 25
polm23 Avatar answered Sep 10 '25 06:09

polm23