Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Visualize Gensim Word2vec Embeddings in Tensorboard Projector

I've only seen a few questions that ask this, and none of them have an answer yet, so I thought I might as well try. I've been using gensim's word2vec model to create some vectors. I exported them into text, and tried importing it on tensorflow's live model of the embedding projector. One problem. It didn't work. It told me that the tensors were improperly formatted. So, being a beginner, I thought I would ask some people with more experience about possible solutions.
Equivalent to my code:

import gensim
corpus = [["words","in","sentence","one"],["words","in","sentence","two"]]
model = gensim.models.Word2Vec(iter = 5,size = 64)
model.build_vocab(corpus)
# save memory
vectors = model.wv
del model
vectors.save_word2vec_format("vect.txt",binary = False)

That creates the model, saves the vectors, and then prints the results out nice and pretty in a tab delimited file with values for all of the dimensions. I understand how to do what I'm doing, I just can't figure out what's wrong with the way I put it in tensorflow, as the documentation regarding that is pretty scarce as far as I can tell.
One idea that has been presented to me is implementing the appropriate tensorflow code, but I don’t know how to code that, just import files in the live demo.

Edit: I have a new problem now. The object I have my vectors in is non-iterable because gensim apparently decided to make its own data structures that are non-compatible with what I'm trying to do.
Ok. Done with that too! Thanks for your help!

like image 455
I. Blum Avatar asked May 23 '18 15:05

I. Blum


People also ask

How do you visualize embeds in TensorBoard?

In order to visualize the embeddings, we select PROJECTOR from the dropdown menu on the top right of the TensorBoard dashboard: Image by author.

Is Gensim Word2Vec CBOW or skip gram?

The word2vec algorithms include skip-gram and CBOW models, using either hierarchical softmax or negative sampling: Tomas Mikolov et al: Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov et al: Distributed Representations of Words and Phrases and their Compositionality.

How do you visualize an embed?

To visualize the word embedding, we are going to use common dimensionality reduction techniques such as PCA and t-SNE. To map the words into their vector representations in embedding space, the pre-trained word embedding GloVe will be implemented.

Is Gensim used for word embedding?

Gensim Python Library Most notably for this tutorial, it supports an implementation of the Word2Vec word embedding for learning new word vectors from text. It also provides tools for loading pre-trained word embeddings in a few formats and for making use and querying a loaded embedding.


2 Answers

What you are describing is possible. What you have to keep in mind is that Tensorboard reads from saved tensorflow binaries which represent your variables on disk.

More information on saving and restoring tensorflow graph and variables here

The main task is therefore to get the embeddings as saved tf variables.

Assumptions:

  • in the following code embeddings is a python dict {word:np.array (np.shape==[embedding_size])}

  • python version is 3.5+

  • used libraries are numpy as np, tensorflow as tf

  • the directory to store the tf variables is model_dir/


Step 1: Stack the embeddings to get a single np.array

embeddings_vectors = np.stack(list(embeddings.values(), axis=0))
# shape [n_words, embedding_size]

Step 2: Save the tf.Variable on disk

# Create some variables.
emb = tf.Variable(embeddings_vectors, name='word_embeddings')

# Add an op to initialize the variable.
init_op = tf.global_variables_initializer()

# Add ops to save and restore all the variables.
saver = tf.train.Saver()

# Later, launch the model, initialize the variables and save the
# variables to disk.
with tf.Session() as sess:
   sess.run(init_op)

# Save the variables to disk.
   save_path = saver.save(sess, "model_dir/model.ckpt")
   print("Model saved in path: %s" % save_path)

model_dir should contain files checkpoint, model.ckpt-1.data-00000-of-00001, model.ckpt-1.index, model.ckpt-1.meta


Step 3: Generate a metadata.tsv

To have a beautiful labeled cloud of embeddings, you can provide tensorboard with metadata as Tab-Separated Values (tsv) (cf. here).

words = '\n'.join(list(embeddings.keys()))

with open(os.path.join('model_dir', 'metadata.tsv'), 'w') as f:
   f.write(words)

# .tsv file written in model_dir/metadata.tsv

Step 4: Visualize

Run $ tensorboard --logdir model_dir -> Projector.

To load metadata, the magic happens here:

load_meta


As a reminder, some word2vec embedding projections are also available on http://projector.tensorflow.org/

like image 109
syltruong Avatar answered Oct 19 '22 09:10

syltruong


Gensim actually has the official way to do this.

Documentation about it

like image 41
Marco Oliveira Avatar answered Oct 19 '22 11:10

Marco Oliveira