When trying lemmatize in Spanish a csv with more than 60,000 words, SpaCy does not correctly write certain words, I understand that the model is not 100% accurate. However, I have not found any other solution, since NLTK does not bring a Spanish core.
A friend tried to ask this question in Spanish Stackoverflow, however, the community is quite small compared with this community, and we got no answers about it.
code:
nlp = spacy.load('es_core_news_sm')
def lemmatizer(text):
doc = nlp(text)
return ' '.join([word.lemma_ for word in doc])
df['column'] = df['column'].apply(lambda x: lemmatizer(x))
I tried to lemmatize certain words that I found wrong to prove that SpaCy is not doing it correctly:
text = 'personas, ideas, cosas'
# translation: persons, ideas, things
print(lemmatizer(text))
# Current output:
personar , ideo , coser
# translation:
personify, ideo, sew
# The expected output should be:
persona, idea, cosa
# translation:
person, idea, thing
Unlike the English lemmatizer, spaCy's Spanish lemmatizer does not use PoS information at all. It relies on a lookup list of inflected verbs and lemmas (e.g., ideo idear, ideas idear, idea idear, ideamos idear, etc.). It will just output the first match in the list, regardless of its PoS.
I actually developed spaCy's new rule-based lemmatizer for Spanish, which takes PoS and morphological information (such as tense, gender, number) into account. These fine-grained rules make it a lot more accurate than the current lookup lemmatizer. It will be released soon!
Meanwhile, you can maybe use Stanford CoreNLP or FreeLing.
One option is to make your own lemmatizer.
This might sound frightening, but fear not! It is actually very simple to do one.
I've recently made a tutorial on how to make a lemmatizer, the link is here:
https://medium.com/analytics-vidhya/how-to-build-a-lemmatizer-7aeff7a1208c
As a summary, you'd have to:
In code, it'd look like this:
def lemmatize(word, pos):
if word in dict:
if pos in dict[word]:
return dict[word][pos]
return word
Simple, right?
In fact, simple lemmatization doesn't require a lot of processing as one would think. The hard part lies at PoS Tagging, but you have that for free. Either way, if you want to do Tagging yourself, you can see this other tutorial I made:
https://medium.com/analytics-vidhya/part-of-speech-tagging-what-when-why-and-how-9d250e634df6
Hope you get it solved.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With