Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently Finding Closest Word In TensorFlow Embedding

Recently, I've been trying to find the closest word to an embedding. The two most notable ways of doing this is by cosine distance or euclidean distance.

I'm trying to find how to efficiently compute the cosine distance for a tensor of shape [batch_size x embedding_size]

One approach is to unpack the tensor and the compute the cosine distance

  #embedding is shape [vocab_size x embedding size]
  array_list = tf.unpack(batch_array)
  word_class_list = tf.unpack(embedding)
  index_list_of_closest_word = []
  for eacharray in array_list:
    list_of_distances = []
    for eachwordclass in word_class_list:
      list_of_distances.append(cosine_distance(eacharray, eachwordclass))
    index_list_of_closest_word.append(tf.argmax(tf.pack(list_of_distances)))

However, this approach is terribly inefficient. Is there perhaps a more efficient manner to do this? I know word2vec does this pretty fast and tensorflow, with the power of a gpu, should be able to do these batch calculations in parallel.

Thanks!

like image 307
LeavesBreathe Avatar asked Jun 01 '16 03:06

LeavesBreathe


1 Answers

The cosine similarity formula is:
cosine similarity


The inputs you have are:

  • embedding: the embedding matrix, of shape [vocab_size, embedding_size]
  • batch_array: a batch of embeddings, to which you want to find the closest words, of shape [batch_size, embedding_size]
embedding = tf.placeholder(tf.float32, [vocab_size, embedding_size])
batch_array = tf.placeholder(tf.float32, [batch_size, embedding_size])

To compute the cosine similarity, you can first L2 normalize both inputs:
(you may want to store the normed embedding, as you are going to reuse it a lot)

normed_embedding = tf.nn.l2_normalize(embedding, dim=1)
normed_array = tf.nn.l2_normalize(batch_array, dim=1)

Then you have to compute the dot products of all words (vocab_size in total) vs. all arrays from the batch (batch_size in total):

cosine_similarity = tf.matmul(normed_array, tf.transpose(normed_embedding, [1, 0]))

You can finally compute the argmax for each element of the batch:

closest_words = tf.argmax(cosine_similarity, 1)  # shape [batch_size], type int64
like image 196
Olivier Moindrot Avatar answered Oct 11 '22 15:10

Olivier Moindrot