Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use spacy Spanish Tokenizer

I always used spacy library with english or german.

To load the library I used this code:

import spacy
nlp = spacy.load('en')

I would like to use the Spanish tokeniser, but I do not know how to do it, because spacy does not have a spanish model. I've tried this

python -m spacy download es

and then:

nlp = spacy.load('es')

But obviously without any success.

Does someone know how to tokenise a spanish sentence with spanish in the proper way?

like image 211
Luca Ambrosini Avatar asked Mar 22 '17 09:03

Luca Ambrosini


People also ask

How do you use a spaCy tokenizer?

In Spacy, the process of tokenizing a text into segments of words and punctuation is done in various steps. It processes the text from left to right. First, the tokenizer split the text on whitespace similar to the split() function. Then the tokenizer checks whether the substring matches the tokenizer exception rules.

How many languages does spaCy support?

Since then, spaCy has grown to support over 50 languages. Both spaCy and NLTK support English, German, French, Spanish, Portuguese, Italian, Dutch, and Greek.

Which is better NLTK or spaCy?

spaCy has support for word vectors whereas NLTK does not . As spaCy uses the latest and best algorithms, its performance is usually good as compared to NLTK. As we can see below, in word tokenization and POS-tagging spaCy performs better, but in sentence tokenization, NLTK outperforms spaCy.

Does spaCy automatically Tokenize?

SpaCy automatically breaks your document into tokens when a document is created using the model.


1 Answers

For version till 1.6 this code works properly:

from spacy.es import Spanish
nlp = Spanish()

but in version 1.7.2 a little change is necessary:

from spacy.es import Spanish
nlp = Spanish(path=None)

Source:@honnibal in gitter chat

like image 74
Luca Ambrosini Avatar answered Nov 12 '22 20:11

Luca Ambrosini