I am currently trying to train a text classifier using spacy and I got stuck with following question: what is the difference between creating a blank model using spacy.blank('en') and using a pretrained model spacy.load('en_core_web_sm'). Just to see the difference I wrote this code:
text = "hello everyone, it's a wonderful day today"
nlp1 = spacy.load('en_core_web_sm')
for token in nlp1(text):
print(token.text, token.lemma_, token.is_stop, token.pos_)
and it gave me the following result:
hello hello False INTJ
everyone everyone True PRON
, , False PUNCT
it -PRON- True PRON
's be True AUX
a a True DET
wonderful wonderful False ADJ
day day False NOUN
today today False NOUN
Then I tried this (for the same text)
nlp2 = spacy.blank('en')
for token in nlp2(text):
print(token.text, token.lemma_, token.is_stop, token.pos_)
and the result was
hello hello False
everyone everyone True
, , False
it -PRON- True PRON
's 's True
a a True
wonderful wonderful False
day day False
today today False
Not only are the results different (for example, lemma for 's is different) but there are also no POS tagging for most of words in blank model.
So obviously I need a pretrained model for normalizing my data. But I still don't understand how it should be with my data classifier. Should I 1) create a blank model for training text classifier (using nlp.update()) and load a pretrained model for removing stop words, lemmatization and POS tagging or 2) only load a pretrained model for both: normalizing and training my text classifier?
Thanks in advance for any advice!
If you are using spacy's text classifier, then it is fine to start with a blank model. The TextCategorizer doesn't use features from any other pipeline components.
If you're using spacy to preprocess data for another text classifier, then you would need to decide which components make sense for your task. The pretrained models load a tagger, parser, and NER model by default.
The lemmatizer, which isn't implemented as a separate component, is the most complicated part of this. It tries to provide the best results with the available data and models:
If you don't have the package spacy-lookups-data installed and you create a blank model, you'll get the lowercase form as a default/dummy lemma.
If you have the package spacy-lookups-data installed and you create a blank model, it will automatically load lookup lemmas if they're available for that language.
If you load a provided model and the pipeline includes a tagger, the lemmatizer switches to a better rule-base lemmatizer if one is available in spacy for that language (currently: Greek, English, French, Norwegian Bokmål, Dutch, Swedish). The provided models also always include the lookup data for that language so they can be used when the tagger isn't run.
If you want to get the lookup lemmas from a provided model, you can see them by loading the model without the tagger:
import spacy
nlp = spacy.load("en_core_web_sm", disable=["tagger"])
In general, the lookup lemma quality is not great (there's no information to help with ambiguous cases) and the rule-based lemmas will be a lot better, however it does take additional time to run the tagger, so you can choose lookup lemmas to speed things up if the quality is good enough for your task.
And if you're not using the parser or NER model for preprocessing, you can speed things up by disabling them:
nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With