I have had the gensim Word2Vec implementation compute some word embeddings for me. Everything went quite fantastically as far as I can tell; now I am clustering the word vectors created, hoping to get some semantic groupings. As a next step, I would like to look at the words (rather than the vectors) contained in each cluster. I.e. if I have the vector of embeddings <code>[x, y, z]</code>, I would like to find out which actual word this vector represents. I can get the words/Vocab items by calling <code>model.vocab</code> and the word vectors through <code>model.syn0</code>. But I could not find a location where these are explicitly matched. This was more complicated than I expected and I feel I might be missing the obvious way of doing it. Any help is appreciated! <h3>Problem:</h3> Match words to embedding vectors created by <code>Word2Vec ()</code> -- how do I do it? <h3>My approach:</h3> After creating the model (code below*), I would now like to match the indexes assigned to each word (during the <code>build_vocab()</code> phase) to the vector matrix outputted as <code>model.syn0</code>. Thus <pre class="prettyprint"><code>for i in range (0, newmod.syn0.shape[0]): #iterate over all words in model print i word= [k for k in newmod.vocab if newmod.vocab[k].__dict__['index']==i] #get the word out of the internal dicationary by its index wordvector= newmod.syn0[i] #get the vector with the corresponding index print wordvector == newmod[word] #testing: compare result of looking up the word in the model -- this prints True </code></pre> <ul> <li> Is there a better way of doing this, e.g. by feeding the vector into the model to match the word? </li> <li> Does this even get me correct results? </li> </ul> *My code to create the word vectors: <pre class="prettyprint"><code>model = Word2Vec(size=1000, min_count=5, workers=4, sg=1) model.build_vocab(sentencefeeder(folderlist)) #sentencefeeder puts out sentences as lists of strings model.save("newmodel") </code></pre> I found this question which is similar but has not really been answered.

I have been searching for a long time to find the mapping between the syn0 matrix and the vocabulary... here is the answer : use <code>model.index2word</code> which is simply the list of words in the right order ! This is not in the official documentation (why ?) but it can be found directly inside the source code : https://github.com/RaRe-Technologies/gensim/blob/3b9bb59dac0d55a1cd6ca8f984cead38b9cb0860/gensim/models/word2vec.py#L441

Matching words and vectors in gensim Word2Vec model

Tags:

I have had the gensim Word2Vec implementation compute some word embeddings for me. Everything went quite fantastically as far as I can tell; now I am clustering the word vectors created, hoping to get some semantic groupings.

As a next step, I would like to look at the words (rather than the vectors) contained in each cluster. I.e. if I have the vector of embeddings [x, y, z], I would like to find out which actual word this vector represents. I can get the words/Vocab items by calling model.vocab and the word vectors through model.syn0. But I could not find a location where these are explicitly matched.

This was more complicated than I expected and I feel I might be missing the obvious way of doing it. Any help is appreciated!

Problem:

Match words to embedding vectors created by Word2Vec () -- how do I do it?

My approach:

After creating the model (code below*), I would now like to match the indexes assigned to each word (during the build_vocab() phase) to the vector matrix outputted as model.syn0. Thus

for i in range (0, newmod.syn0.shape[0]): #iterate over all words in model
    print i
    word= [k for k in newmod.vocab if newmod.vocab[k].__dict__['index']==i] #get the word out of the internal dicationary by its index
    wordvector= newmod.syn0[i] #get the vector with the corresponding index
    print wordvector == newmod[word] #testing: compare result of looking up the word in the model -- this prints True

Is there a better way of doing this, e.g. by feeding the vector into the model to match the word?
Does this even get me correct results?

*My code to create the word vectors:

model = Word2Vec(size=1000, min_count=5, workers=4, sg=1)
        
model.build_vocab(sentencefeeder(folderlist)) #sentencefeeder puts out sentences as lists of strings

model.save("newmodel")

I found this question which is similar but has not really been answered.

316

asked Jul 29 '16 18:07

patrick

Video Answer

1 Answers

I have been searching for a long time to find the mapping between the syn0 matrix and the vocabulary... here is the answer : use model.index2word which is simply the list of words in the right order !

This is not in the official documentation (why ?) but it can be found directly inside the source code : https://github.com/RaRe-Technologies/gensim/blob/3b9bb59dac0d55a1cd6ca8f984cead38b9cb0860/gensim/models/word2vec.py#L441

122

answered Sep 22 '22 08:09

Arcyno

Related questions
                            
                                How can I add badges/shields to github repos?
                            
                                WFLYCTL0412: Required services that are not installed:
                            
                                Gradle - equivalent of test {} configuration block for android
                            
                                Can't assign value to variable using subscribe() method in Angular 2
                            
                                How to write ResponseEntity to HttpServletResponse?
                            
                                Different themes for different languages in Visual Studio Code
                            
                                How to print Greek letter delta in c++
                            
                                Constructor injection vs Field injection [duplicate]
                            
                                Matplotlib title spanning two (or any number of) subplot columns
                            
                                Read data (.dat file) with Pandas
                            
                                JS: how to use generator and yield in a callback
                            
                                How to convert an Observable to a ReplaySubject?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With