What is the difference between the word vectors given in en_core_web_lg and en_vectors_web_lg? The number of keys are different: 1.1m vs 685k. I assume this means the en_vectors_web_lg has broader coverage by maintaining morphological information somewhat resulting in more distinct tokens as they are both trained on the common crawl corpus but have a different number of tokens.
The en_vectors_web_lg
package has exactly every vector provided by the original GloVe model. The en_core_web_lg
model uses the vocabulary from the v1.x en_core_web_lg
model, which from memory pruned out all entries which occurred fewer than 10 times in a 10 billion word dump of Reddit comments.
In theory, most of the vectors that were removed should be things that the spaCy tokenizer never produces. However, earlier experiments with the full GloVe vectors did score slightly higher than the current NER model --- so it's possible we're actually missing out on something by losing the extra vectors. I'll do more experiments on this, and likely switch the lg
model to include the unpruned vector table, especially now that we have the md
model, which strikes a better compromise than the current lg
package.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With