Lemmatize French text [closed]

Tags:

I have some text in French that I need to process in some ways. For that, I need to:

First, tokenize the text into words
Then lemmatize those words to avoid processing the same root more than once

As far as I can see, the wordnet lemmatizer in the NLTK only works with English. I want something that can return "vouloir" when I give it "voudrais" and so on. I also cannot tokenize properly because of the apostrophes. Any pointers would be greatly appreciated. :)

231

asked Oct 29 '12 23:10

yelsayed

2 Answers

The best solution I found is spacy, it seems to do the job

To install:

pip3 install spacy python3 -m spacy download fr_core_news_md

To use:

import spacy nlp = spacy.load('fr_core_news_md')  doc = nlp(u"voudrais non animaux yeux dors couvre.") for token in doc:     print(token, token.lemma_)

Result:

voudrais vouloir non non animaux animal yeux oeil dors dor couvre couvrir

checkout the documentation for more details: https://spacy.io/models/fr && https://spacy.io/usage

answered Sep 26 '22 23:09

karimsaieh

Here's an old but relevant comment by an nltk dev. Looks like most advanced stemmers in nltk are all English specific:

The nltk.stem module currently contains 3 stemmers: the Porter stemmer, the Lancaster stemmer, and a Regular-Expression based stemmer. The Porter stemmer and Lancaster stemmer are both English- specific. The regular-expression based stemmer can be customized to use any regular expression you wish. So you should be able to write a simple stemmer for non-English languages using the regexp stemmer. For example, for french:
from nltk import stem stemmer = stem.Regexp('s$|es$|era$|erez$|ions$| <etc> ') 
But you'd need to come up with the language-specific regular expression yourself. For a more advanced stemmer, it would probably be necessary to add a new module. (This might be a good student project.)

For more information on the regexp stemmer:

http://nltk.org/doc/api/nltk.stem.regexp.Regexp-class.html

-Edward

Note: the link he gives is dead, see here for the current regexstemmer documentation.

The more recently added snowball stemmer appears to be able to stem French though. Let's put it to the test:

>>> from nltk.stem.snowball import FrenchStemmer >>> stemmer = FrenchStemmer() >>> stemmer.stem('voudrais') u'voudr' >>> stemmer.stem('animaux') u'animal' >>> stemmer.stem('yeux') u'yeux' >>> stemmer.stem('dors') u'dor' >>> stemmer.stem('couvre') u'couvr'

As you can see, some results are a bit dubious.

Not quite what you were hoping for, but I guess it's a start.

answered Sep 24 '22 23:09

Junuxx

Related questions
                            
                                How do I extract all the values of a specific key from a list of dictionaries?
                            
                                python pandas timeseries plots, how to set xlim and xticks outside ts.plot()?
                            
                                Check if model field exists in Django
                            
                                Deleting rows with Python in a CSV file
                            
                                Python -- read_pickle ImportError: No module named indexes.base
                            
                                NoReturn vs. None in "void" functions - type annotations in Python 3.6
                            
                                geopandas point in polygon
                            
                                Python: How to read huge text file into memory
                            
                                Boost and Python 3.x
                            
                                How to redefine a color for a specific value in a matplotlib colormap
                            
                                What's the difference between Model.id and Model.pk in django?
                            
                                Django: Insert row into database
                            
                                Why does indexing numpy arrays with brackets and commas differ in behavior?
                            
                                Strategy for partitioning dask dataframes efficiently
                            
                                How to use fit_generator with multiple inputs
                            
                                Accessing dictionary items by position in Python 3.6+ efficiently
                            
                                Detect unused imports in visual studio code for python 3?
                            
                                Why do I get the 'loop of ufunc does not support argument 0 of type int' error for numpy.exp?
                            
                                How to Close a program using python?
                            
                                How to get all sub-elements of an element tree with Python ElementTree?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Lemmatize French text [closed]

Tags:

python

nltk

lemmatization

yelsayed

People also ask

2 Answers

karimsaieh

Junuxx

Recent Activity

Donate For Us