Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lemmatize French text [closed]

I have some text in French that I need to process in some ways. For that, I need to:

  • First, tokenize the text into words
  • Then lemmatize those words to avoid processing the same root more than once

As far as I can see, the wordnet lemmatizer in the NLTK only works with English. I want something that can return "vouloir" when I give it "voudrais" and so on. I also cannot tokenize properly because of the apostrophes. Any pointers would be greatly appreciated. :)

like image 231
yelsayed Avatar asked Oct 29 '12 23:10

yelsayed


People also ask

What is Lemmatization example?

In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words. For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.

What is difference between stemming and Lemmatization?

Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.

How do you Lemmatize a text?

In order to lemmatize, you need to create an instance of the WordNetLemmatizer() and call the lemmatize() function on a single word. Let's lemmatize a simple sentence. We first tokenize the sentence into words using nltk. word_tokenize and then we will call lemmatizer.

How to use stemming and Lemmatization?

Difference between Stemming & Lemmatization PorterStemmer class chops off the 'es' from the word. On the other hand, WordNetLemmatizer class finds a valid word. In simple words, stemming technique only looks at the form of the word whereas lemmatization technique looks at the meaning of the word.


2 Answers

The best solution I found is spacy, it seems to do the job

To install:

pip3 install spacy python3 -m spacy download fr_core_news_md 

To use:

import spacy nlp = spacy.load('fr_core_news_md')  doc = nlp(u"voudrais non animaux yeux dors couvre.") for token in doc:     print(token, token.lemma_) 

Result:

voudrais vouloir non non animaux animal yeux oeil dors dor couvre couvrir 

checkout the documentation for more details: https://spacy.io/models/fr && https://spacy.io/usage

like image 75
karimsaieh Avatar answered Sep 26 '22 23:09

karimsaieh


Here's an old but relevant comment by an nltk dev. Looks like most advanced stemmers in nltk are all English specific:

The nltk.stem module currently contains 3 stemmers: the Porter stemmer, the Lancaster stemmer, and a Regular-Expression based stemmer. The Porter stemmer and Lancaster stemmer are both English- specific. The regular-expression based stemmer can be customized to use any regular expression you wish. So you should be able to write a simple stemmer for non-English languages using the regexp stemmer. For example, for french:

from nltk import stem stemmer = stem.Regexp('s$|es$|era$|erez$|ions$| <etc> ') 

But you'd need to come up with the language-specific regular expression yourself. For a more advanced stemmer, it would probably be necessary to add a new module. (This might be a good student project.)

For more information on the regexp stemmer:

http://nltk.org/doc/api/nltk.stem.regexp.Regexp-class.html

-Edward

Note: the link he gives is dead, see here for the current regexstemmer documentation.

The more recently added snowball stemmer appears to be able to stem French though. Let's put it to the test:

>>> from nltk.stem.snowball import FrenchStemmer >>> stemmer = FrenchStemmer() >>> stemmer.stem('voudrais') u'voudr' >>> stemmer.stem('animaux') u'animal' >>> stemmer.stem('yeux') u'yeux' >>> stemmer.stem('dors') u'dor' >>> stemmer.stem('couvre') u'couvr' 

As you can see, some results are a bit dubious.

Not quite what you were hoping for, but I guess it's a start.

like image 31
Junuxx Avatar answered Sep 24 '22 23:09

Junuxx