I've recently reviewed an interesting implementation for convolutional text classification. However all TensorFlow code I've reviewed uses a random (not pre-trained) embedding vectors like the following:
with tf.device('/cpu:0'), tf.name_scope("embedding"): W = tf.Variable( tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0), name="W") self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x) self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
Does anybody know how to use the results of Word2vec or a GloVe pre-trained word embedding instead of a random one?
In the practice, Word2Vec employs negative sampling by converting the softmax function as the sigmoid function. This conversion results in cone-shaped clusters of the words in the vector space while GloVe's word vectors are more discrete in the space which makes the word2vec faster in the computation than the GloVe.
Word2Vec takes texts as training data for a neural network. The resulting embedding captures whether words appear in similar contexts. GloVe focuses on words co-occurrences over the whole corpus. Its embeddings relate to the probabilities that two words appear together.
This can mean that for solving semantic NLP tasks, when the training set at hand is sufficiently large (as was the case in the Sentiment Analysis experiments), it is better to use pre-trained word embeddings.
Google's Word2vec Pretrained Word Embedding Word2Vec is one of the most popular pretrained word embeddings developed by Google. Word2Vec is trained on the Google News dataset (about 100 billion words).
There are a few ways that you can use a pre-trained embedding in TensorFlow. Let's say that you have the embedding in a NumPy array called embedding
, with vocab_size
rows and embedding_dim
columns and you want to create a tensor W
that can be used in a call to tf.nn.embedding_lookup()
.
Simply create W
as a tf.constant()
that takes embedding
as its value:
W = tf.constant(embedding, name="W")
This is the easiest approach, but it is not memory efficient because the value of a tf.constant()
is stored multiple times in memory. Since embedding
can be very large, you should only use this approach for toy examples.
Create W
as a tf.Variable
and initialize it from the NumPy array via a tf.placeholder()
:
W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]), trainable=False, name="W") embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim]) embedding_init = W.assign(embedding_placeholder) # ... sess = tf.Session() sess.run(embedding_init, feed_dict={embedding_placeholder: embedding})
This avoid storing a copy of embedding
in the graph, but it does require enough memory to keep two copies of the matrix in memory at once (one for the NumPy array, and one for the tf.Variable
). Note that I've assumed that you want to hold the embedding matrix constant during training, so W
is created with trainable=False
.
If the embedding was trained as part of another TensorFlow model, you can use a tf.train.Saver
to load the value from the other model's checkpoint file. This means that the embedding matrix can bypass Python altogether. Create W
as in option 2, then do the following:
W = tf.Variable(...) embedding_saver = tf.train.Saver({"name_of_variable_in_other_model": W}) # ... sess = tf.Session() embedding_saver.restore(sess, "checkpoint_filename.ckpt")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With