When I run the spacy lemmatizer, it does not lemmatize the word "consulting" and therefore I suspect it is failing.
Here is my code:
nlp = spacy.load('en_core_web_trf', disable=['parser', 'ner'])
lemmatizer = nlp.get_pipe('lemmatizer')
doc = nlp('consulting')
print([token.lemma_ for token in doc])
And my output:
['consulting']
The spaCy lemmatizer is not failing, it's performing as expected. Lemmatization depends heavily on the Part of Speech (PoS) tag assigned to the token, and PoS tagger models are trained on sentences/documents, not single tokens (words). For example, parts-of-speech.info which is based on the Stanford PoS tagger, does not allow you to enter single words.
In your case, the single word "consulting" is being tagged as a noun, and the spaCy model you are using deems "consulting" to be the appropriate lemma for this case. You'll see if you change your string instead to "consulting tomorrow", spaCy will lemmatize "consulting" to "consult" as it is tagged as a verb (see output from the code below). In short, I recommend not trying to perform lemmatization on single tokens, instead, use the model on sentences/documents as it was intended.
As a side note: make sure you understand the difference between a lemma and a stem. Read this section provided on Wikipedia Lemma (morphology) page if you are unsure.
import spacy
nlp = spacy.load('en_core_web_trf', disable=['parser', 'ner'])
doc = nlp('consulting')
print([[token.pos_, token.lemma_] for token in doc])
# Output: [['NOUN', 'consulting']]
doc_verb = nlp('Consulting tomorrow')
print([[token.pos_, token.lemma_] for token in doc_verb])
# Output: [['VERB', 'consult'], ['NOUN', 'tomorrow']]
If you really need to lemmatize single words, the second approach on this GeeksforGeeks Python lemmatization tutorial produces the lemma "consult". I've created a condensed version of it here for future reference in case the link becomes invalid. I haven't tested it on other single tokens (words) so it may not work for all cases.
# Condensed version of approach #2 given in the GeeksforGeeks lemmatizer tutorial:
# https://www.geeksforgeeks.org/python-lemmatization-approaches-with-examples/
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
# POS_TAGGER_FUNCTION : TYPE 1
def pos_tagger(nltk_tag):
if nltk_tag.startswith('J'):
return wordnet.ADJ
elif nltk_tag.startswith('V'):
return wordnet.VERB
elif nltk_tag.startswith('N'):
return wordnet.NOUN
elif nltk_tag.startswith('R'):
return wordnet.ADV
else:
return None
lemmatizer = WordNetLemmatizer()
sentence = 'consulting'
pos_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
lemmatized_sentence = []
for word, tag in pos_tagged:
lemmatized_sentence.append(lemmatizer.lemmatize(word, pos_tagger(tag)))
print(lemmatized_sentence)
# Output: ['consult']
spaCy's lemmatizer behaves differently depending on the part of speech. In particular, for nouns, the "-ing" form is considered to be the lemma already, and is not changed.
Here's an example that illustrates the difference:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "While consulting, I sometimes tell people about the consulting business."
for tok in nlp(text):
print(tok, tok.pos_, tok.lemma_, sep="\t")
Output:
While SCONJ while
consulting VERB consult
, PUNCT ,
I PRON I
sometimes ADV sometimes
tell VERB tell
people NOUN people
about ADP about
the DET the
consulting NOUN consulting
business NOUN business
See how the verb has "consult" as a lemma, while the noun does not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With