I have had the gensim Word2Vec implementation compute some word embeddings for me. Everything went quite fantastically as far as I can tell; now I am clustering the word vectors created, hoping to get some semantic groupings.
As a next step, I would like to look at the words (rather than the vectors) contained in each cluster. I.e. if I have the vector of embeddings [x, y, z]
, I would like to find out which actual word this vector represents. I can get the words/Vocab items by calling model.vocab
and the word vectors through model.syn0
. But I could not find a location where these are explicitly matched.
This was more complicated than I expected and I feel I might be missing the obvious way of doing it. Any help is appreciated!
Match words to embedding vectors created by Word2Vec ()
-- how do I do it?
After creating the model (code below*), I would now like to match the indexes assigned to each word (during the build_vocab()
phase) to the vector matrix outputted as model.syn0
.
Thus
for i in range (0, newmod.syn0.shape[0]): #iterate over all words in model
print i
word= [k for k in newmod.vocab if newmod.vocab[k].__dict__['index']==i] #get the word out of the internal dicationary by its index
wordvector= newmod.syn0[i] #get the vector with the corresponding index
print wordvector == newmod[word] #testing: compare result of looking up the word in the model -- this prints True
Is there a better way of doing this, e.g. by feeding the vector into the model to match the word?
Does this even get me correct results?
*My code to create the word vectors:
model = Word2Vec(size=1000, min_count=5, workers=4, sg=1)
model.build_vocab(sentencefeeder(folderlist)) #sentencefeeder puts out sentences as lists of strings
model.save("newmodel")
I found this question which is similar but has not really been answered.
Word embeddings work by using an algorithm to train a set of fixed-length dense and continuous-valued vectors based on a large corpus of text. Each word is represented by a point in the embedding space and these points are learned and moved around based on the words that surround the target word.
There are two main training algorithms for word2vec, one is the continuous bag of words(CBOW), another is called skip-gram. The major difference between these two methods is that CBOW is using context to predict a target word while skip-gram is using a word to predict a target context.
To assess which word2vec model is best, simply calculate the distance for each pair, do it 200 times, sum up the total distance, and the smallest total distance will be your best model.
I have been searching for a long time to find the mapping between the syn0 matrix and the vocabulary... here is the answer : use model.index2word
which is simply the list of words in the right order !
This is not in the official documentation (why ?) but it can be found directly inside the source code : https://github.com/RaRe-Technologies/gensim/blob/3b9bb59dac0d55a1cd6ca8f984cead38b9cb0860/gensim/models/word2vec.py#L441
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With