Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to solve Spanish lemmatization problems with SpaCy?

When trying lemmatize in Spanish a csv with more than 60,000 words, SpaCy does not correctly write certain words, I understand that the model is not 100% accurate. However, I have not found any other solution, since NLTK does not bring a Spanish core.

A friend tried to ask this question in Spanish Stackoverflow, however, the community is quite small compared with this community, and we got no answers about it.

code:

nlp = spacy.load('es_core_news_sm')

def lemmatizer(text):  
  doc = nlp(text)
  return ' '.join([word.lemma_ for word in doc])

df['column'] = df['column'].apply(lambda x: lemmatizer(x))

I tried to lemmatize certain words that I found wrong to prove that SpaCy is not doing it correctly:

text = 'personas, ideas, cosas' 
# translation: persons, ideas, things

print(lemmatizer(text))
# Current output:
personar , ideo , coser 
# translation:
personify, ideo, sew

# The expected output should be:
persona, idea, cosa

# translation: 
person, idea, thing
like image 803
Y4RD13 Avatar asked Mar 04 '20 21:03

Y4RD13


2 Answers

Unlike the English lemmatizer, spaCy's Spanish lemmatizer does not use PoS information at all. It relies on a lookup list of inflected verbs and lemmas (e.g., ideo idear, ideas idear, idea idear, ideamos idear, etc.). It will just output the first match in the list, regardless of its PoS.

I actually developed spaCy's new rule-based lemmatizer for Spanish, which takes PoS and morphological information (such as tense, gender, number) into account. These fine-grained rules make it a lot more accurate than the current lookup lemmatizer. It will be released soon!

Meanwhile, you can maybe use Stanford CoreNLP or FreeLing.

like image 170
Guadalupe Romero Avatar answered Sep 24 '22 07:09

Guadalupe Romero


One option is to make your own lemmatizer.

This might sound frightening, but fear not! It is actually very simple to do one.

I've recently made a tutorial on how to make a lemmatizer, the link is here:

https://medium.com/analytics-vidhya/how-to-build-a-lemmatizer-7aeff7a1208c

As a summary, you'd have to:

  • Have a POS Tagger (you can use spaCy tagger) to tag input words.
  • Get a corpus of words and their lemmas - here, I suggest you download a Universal Dependencies Corpus for Spanish - just follow the steps in the tutorial mentioned above.
  • Create a lemma dict from the words extracted in the corpus.
  • Save the dict and make a wrapper function that receives both the word and its PoS.

In code, it'd look like this:

def lemmatize(word, pos):
   if word in dict:
      if pos in dict[word]:
          return dict[word][pos]
   return word

Simple, right?

In fact, simple lemmatization doesn't require a lot of processing as one would think. The hard part lies at PoS Tagging, but you have that for free. Either way, if you want to do Tagging yourself, you can see this other tutorial I made:

https://medium.com/analytics-vidhya/part-of-speech-tagging-what-when-why-and-how-9d250e634df6

Hope you get it solved.

like image 35
Tiago Duque Avatar answered Sep 22 '22 07:09

Tiago Duque