Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Differences between en_vectors_web_lg and Glove vectors (spaCy)

Tags:

https://spacy.io/models/en#en_vectors_web_lg stated that the model contains 1.1m keys, but https://nlp.stanford.edu/projects/glove/ stated that the Glove vectors contain 2.2M vocabs

May I know what vocabs are missing?

Thank you very much.

like image 975
hi bye Avatar asked Feb 14 '18 04:02

hi bye


1 Answers

You can examine the vocabulary of spaCy and GloVe models yourself by looking in the spaCy .vocab attribute/object and compare that with the words in the GloVe file. First load the data into two lists:

import spacy
nlp = spacy.load('en_vectors_web_lg')
spacy_words = [word for word in nlp.vocab.strings]
glove_filename = 'glove.840B.300d.txt'
glove_words = [line.split()[0].decode('utf-8') for line in open(glove_filename)]

Then examine the set difference to get "missing" words:

>>> list(set(glove_words) - set(spacy_words))[:10]
[u'Inculcation', u'Dholes', u'6-night', u'AscensionMidkemia',
 u'.90.99', u'USAMol', u'USAMon', u'Connerty', u'RealLife',
 u'NaughtyAllie']

>>> list(set(spacy_words) - set(glove_words))[:10]
[u'ftdna', u'verplank', u'NICARIO', u'Plastic-Treated', u'ZAI-TECH',
 u'Lower-Sulfur', u'desmonds', u'KUDNER', u'berlinghoff', u'50-ACRE']

There is more than 2.2 mio - 1.1 mio ~ 1.1 mio words missing:

>>> len(set(glove_words) - set(spacy_words))
1528158

Note there is a difference between what is in the nlp.vocab.strings and nlp.vocab.vectors. You can load the words from the vector object with

vector_words = []
for key, vector in nlp.vocab.vectors.items():
    try:
        vector_words.append(nlp.vocab.strings[key])
    except KeyError:
        pass

(Regarding try/except: It is unclear to me why some keys are missing in vocab.strings)

With this list you get:

>>> list(set(glove_words) - set(vector_words))[:10]
[u'Inculcation', u'Dholes', u'6-night', u'AscensionMidkemia', u'.90.99',  
 u'USAMol', u'USAMon', u'Connerty', u'RealLife', u'NaughtyAllie']

Update: The question of the discrepancy between vocabularies has been posed here https://github.com/explosion/spaCy/issues/1985.

like image 55
Finn Årup Nielsen Avatar answered Sep 20 '22 13:09

Finn Årup Nielsen