Difference between blank and pretrained models in spacy

Question

I am currently trying to train a text classifier using spacy and I got stuck with following question: what is the difference between creating a blank model using spacy.blank('en') and using a pretrained model spacy.load('en_core_web_sm'). Just to see the difference I wrote this code:

text = "hello everyone, it's a wonderful day today"

nlp1 = spacy.load('en_core_web_sm')
for token in nlp1(text):
    print(token.text, token.lemma_, token.is_stop, token.pos_)

and it gave me the following result:

hello hello False INTJ

everyone everyone True PRON

, , False PUNCT

it -PRON- True PRON

's be True AUX

a a True DET

wonderful wonderful False ADJ

day day False NOUN

today today False NOUN

Then I tried this (for the same text)

nlp2 = spacy.blank('en')
for token in nlp2(text):
    print(token.text, token.lemma_, token.is_stop, token.pos_)

and the result was

hello hello False

everyone everyone True

, , False

it -PRON- True PRON

's 's True

a a True

wonderful wonderful False

day day False

today today False

Not only are the results different (for example, lemma for 's is different) but there are also no POS tagging for most of words in blank model.

So obviously I need a pretrained model for normalizing my data. But I still don't understand how it should be with my data classifier. Should I 1) create a blank model for training text classifier (using nlp.update()) and load a pretrained model for removing stop words, lemmatization and POS tagging or 2) only load a pretrained model for both: normalizing and training my text classifier?

Thanks in advance for any advice!

aab · Accepted Answer

If you are using spacy's text classifier, then it is fine to start with a blank model. The TextCategorizer doesn't use features from any other pipeline components.

If you're using spacy to preprocess data for another text classifier, then you would need to decide which components make sense for your task. The pretrained models load a tagger, parser, and NER model by default.

The lemmatizer, which isn't implemented as a separate component, is the most complicated part of this. It tries to provide the best results with the available data and models:

If you don't have the package spacy-lookups-data installed and you create a blank model, you'll get the lowercase form as a default/dummy lemma.
If you have the package spacy-lookups-data installed and you create a blank model, it will automatically load lookup lemmas if they're available for that language.
If you load a provided model and the pipeline includes a tagger, the lemmatizer switches to a better rule-base lemmatizer if one is available in spacy for that language (currently: Greek, English, French, Norwegian Bokmål, Dutch, Swedish). The provided models also always include the lookup data for that language so they can be used when the tagger isn't run.

If you want to get the lookup lemmas from a provided model, you can see them by loading the model without the tagger:

import spacy
nlp = spacy.load("en_core_web_sm", disable=["tagger"])

In general, the lookup lemma quality is not great (there's no information to help with ambiguous cases) and the rule-based lemmas will be a lot better, however it does take additional time to run the tagger, so you can choose lookup lemmas to speed things up if the quality is good enough for your task.

And if you're not using the parser or NER model for preprocessing, you can speed things up by disabling them:

nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])

Difference between blank and pretrained models in spacy

Tags:

python

text-classification

spacy

Oleg Ivanytskyi

1 Answers

aab

Recent Activity

Donate For Us

Difference between blank and pretrained models in spacy

Tags:

python

text-classification

spacy

Oleg Ivanytskyi

1 Answers

aab

Related questions

Recent Activity

Donate For Us