Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nltk french tokenizer in python not working

Why is the french tokenizer that comes with python not working for me? Am I doing something wrong?

I'm doing

import nltk
content_french = ["Les astronomes amateurs jouent également un rôle important en recherche; les plus sérieux participant couramment au suivi d'étoiles variables, à la découverte de nouveaux astéroïdes et de nouvelles comètes, etc.", 'Séquence vidéo.', "John Richard Bond explique le rôle de l'astronomie."]
tokenizer = nltk.data.load('tokenizers/punkt/PY3/french.pickle')
for i in content_french:
        print(i)
        print(tokenizer.tokenize(i))

But I get non-tokenized output like

John Richard Bond explique le rôle de l'astronomie.
["John Richard Bond explique le rôle de l'astronomie."]
like image 336
Atirag Avatar asked Feb 23 '17 23:02

Atirag


People also ask

Does NLTK support French?

Both spaCy and NLTK support English, German, French, Spanish, Portuguese, Italian, Dutch, and Greek.

How do I use NLTK Tokenize in Python?

NLTK contains a module called tokenize() which further classifies into two sub-categories: Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.

What is NLTK download (' Punkt ')?

punkt is the required package for tokenization. Hence you may download it using nltk download manager or download it programmatically using nltk. download('punkt') .


1 Answers

tokenizer.tokenize() is sentence tokenizer (splitter). If you want to tokenize words then use word_tokenize():

import nltk
from nltk.tokenize import word_tokenize

content_french = ["Les astronomes amateurs jouent également un rôle important en recherche; les plus sérieux participant couramment au suivi d'étoiles variables, à la découverte de nouveaux astéroïdes et de nouvelles comètes, etc.", 'Séquence vidéo.', "John Richard Bond explique le rôle de l'astronomie."]
for i in content_french:
        print(i)
        print(word_tokenize(i, language='french'))

Reference

like image 93
Yohanes Gultom Avatar answered Oct 05 '22 22:10

Yohanes Gultom