Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching words and vectors in gensim Word2Vec model

Tags:

I have had the gensim Word2Vec implementation compute some word embeddings for me. Everything went quite fantastically as far as I can tell; now I am clustering the word vectors created, hoping to get some semantic groupings.

As a next step, I would like to look at the words (rather than the vectors) contained in each cluster. I.e. if I have the vector of embeddings [x, y, z], I would like to find out which actual word this vector represents. I can get the words/Vocab items by calling model.vocab and the word vectors through model.syn0. But I could not find a location where these are explicitly matched.

This was more complicated than I expected and I feel I might be missing the obvious way of doing it. Any help is appreciated!

Problem:

Match words to embedding vectors created by Word2Vec () -- how do I do it?

My approach:

After creating the model (code below*), I would now like to match the indexes assigned to each word (during the build_vocab() phase) to the vector matrix outputted as model.syn0. Thus

for i in range (0, newmod.syn0.shape[0]): #iterate over all words in model
    print i
    word= [k for k in newmod.vocab if newmod.vocab[k].__dict__['index']==i] #get the word out of the internal dicationary by its index
    wordvector= newmod.syn0[i] #get the vector with the corresponding index
    print wordvector == newmod[word] #testing: compare result of looking up the word in the model -- this prints True
  • Is there a better way of doing this, e.g. by feeding the vector into the model to match the word?

  • Does this even get me correct results?

*My code to create the word vectors:

model = Word2Vec(size=1000, min_count=5, workers=4, sg=1)
        
model.build_vocab(sentencefeeder(folderlist)) #sentencefeeder puts out sentences as lists of strings

model.save("newmodel")

I found this question which is similar but has not really been answered.

like image 316
patrick Avatar asked Jul 29 '16 18:07

patrick


People also ask

How does gensim word2vec work?

Word embeddings work by using an algorithm to train a set of fixed-length dense and continuous-valued vectors based on a large corpus of text. Each word is represented by a point in the embedding space and these points are learned and moved around based on the words that surround the target word.

Is gensim word2vec CBOW or skip gram?

There are two main training algorithms for word2vec, one is the continuous bag of words(CBOW), another is called skip-gram. The major difference between these two methods is that CBOW is using context to predict a target word while skip-gram is using a word to predict a target context.

How do you evaluate a word2vec model?

To assess which word2vec model is best, simply calculate the distance for each pair, do it 200 times, sum up the total distance, and the smallest total distance will be your best model.


Video Answer


1 Answers

I have been searching for a long time to find the mapping between the syn0 matrix and the vocabulary... here is the answer : use model.index2word which is simply the list of words in the right order !

This is not in the official documentation (why ?) but it can be found directly inside the source code : https://github.com/RaRe-Technologies/gensim/blob/3b9bb59dac0d55a1cd6ca8f984cead38b9cb0860/gensim/models/word2vec.py#L441

like image 122
Arcyno Avatar answered Sep 22 '22 08:09

Arcyno