Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gensim: KeyError: "word not in vocabulary"

I have a trained Word2vec model using Python's Gensim Library. I have a tokenized list as below. The vocab size is 34 but I am just giving few out of 34:

b = ['let',
 'know',
 'buy',
 'someth',
 'featur',
 'mashabl',
 'might',
 'earn',
 'affili',
 'commiss',
 'fifti',
 'year',
 'ago',
 'graduat',
 '21yearold',
 'dustin',
 'hoffman',
 'pull',
 'asid',
 'given',
 'one',
 'piec',
 'unsolicit',
 'advic',
 'percent',
 'buy']

Model

model = gensim.models.Word2Vec(b,min_count=1,size=32)
print(model) 
### prints: Word2Vec(vocab=34, size=32, alpha=0.025) ####

if I try to get the similarity score by doing model['buy'] of one the words in the list, I get the

KeyError: "word 'buy' not in vocabulary"

Can you guys suggest me what I am doing wrong and what are the ways to check the model which can be further used to train PCA or t-sne in order to visualize similar words forming a topic? Thank you.

like image 370
Krishnang K Dalal Avatar asked Jul 31 '17 15:07

Krishnang K Dalal


2 Answers

The first parameter passed to gensim.models.Word2Vec is an iterable of sentences. Sentences themselves are a list of words. From the docs:

Initialize the model from an iterable of sentences. Each sentence is a list of words (unicode strings) that will be used for training.

Right now, it thinks that each word in your list b is a sentence and so it is doing Word2Vec for each character in each word, as opposed to each word in your b. Right now you can do:

model = gensim.models.Word2Vec(b,min_count=1,size=32)

print(model['a'])
array([  7.42487283e-03,  -5.65282721e-03,   1.28707094e-02, ... ]

To get it to work for words, simply wrap b in another list so that it is interpreted correctly:

model = gensim.models.Word2Vec([b],min_count=1,size=32)

print(model['buy'])
array([-0.01331611,  0.00496594, -0.00165093, -0.01444992,  0.01393849, ... ]
like image 129
bunji Avatar answered Nov 16 '22 08:11

bunji


From the docs you need to pass iterable sentences so whatever you pass to the function it treats input as a iterable so here you are passing only words so it counts word2vec vector for each in charecter in the whole corpus.

So In order to avoid that problem, pass the list of words inside a list.

word2vec_model = gensim.models.Word2Vec([b],min_count=1,size=32)
like image 36
Ravi Avatar answered Nov 16 '22 08:11

Ravi