Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spacy 2.0 en_vectors_web_lg vs en_core_web_lg

Tags:

spacy

What is the difference between the word vectors given in en_core_web_lg and en_vectors_web_lg? The number of keys are different: 1.1m vs 685k. I assume this means the en_vectors_web_lg has broader coverage by maintaining morphological information somewhat resulting in more distinct tokens as they are both trained on the common crawl corpus but have a different number of tokens.

like image 204
Michael Anslow Avatar asked Nov 08 '17 15:11

Michael Anslow


1 Answers

The en_vectors_web_lg package has exactly every vector provided by the original GloVe model. The en_core_web_lg model uses the vocabulary from the v1.x en_core_web_lg model, which from memory pruned out all entries which occurred fewer than 10 times in a 10 billion word dump of Reddit comments.

In theory, most of the vectors that were removed should be things that the spaCy tokenizer never produces. However, earlier experiments with the full GloVe vectors did score slightly higher than the current NER model --- so it's possible we're actually missing out on something by losing the extra vectors. I'll do more experiments on this, and likely switch the lg model to include the unpruned vector table, especially now that we have the md model, which strikes a better compromise than the current lg package.

like image 69
syllogism_ Avatar answered Oct 15 '22 16:10

syllogism_