I always used spacy library with english or german.
To load the library I used this code:
import spacy
nlp = spacy.load('en')
I would like to use the Spanish tokeniser, but I do not know how to do it, because spacy does not have a spanish model. I've tried this
python -m spacy download es
and then:
nlp = spacy.load('es')
But obviously without any success.
Does someone know how to tokenise a spanish sentence with spanish in the proper way?
In Spacy, the process of tokenizing a text into segments of words and punctuation is done in various steps. It processes the text from left to right. First, the tokenizer split the text on whitespace similar to the split() function. Then the tokenizer checks whether the substring matches the tokenizer exception rules.
Since then, spaCy has grown to support over 50 languages. Both spaCy and NLTK support English, German, French, Spanish, Portuguese, Italian, Dutch, and Greek.
spaCy has support for word vectors whereas NLTK does not . As spaCy uses the latest and best algorithms, its performance is usually good as compared to NLTK. As we can see below, in word tokenization and POS-tagging spaCy performs better, but in sentence tokenization, NLTK outperforms spaCy.
SpaCy automatically breaks your document into tokens when a document is created using the model.
For version till 1.6 this code works properly:
from spacy.es import Spanish
nlp = Spanish()
but in version 1.7.2 a little change is necessary:
from spacy.es import Spanish
nlp = Spanish(path=None)
Source:@honnibal in gitter chat
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With