Efficiently Finding Closest Word In TensorFlow Embedding

Question

Recently, I've been trying to find the closest word to an embedding. The two most notable ways of doing this is by cosine distance or euclidean distance.

I'm trying to find how to efficiently compute the cosine distance for a tensor of shape [batch_size x embedding_size]

One approach is to unpack the tensor and the compute the cosine distance

  #embedding is shape [vocab_size x embedding size]
  array_list = tf.unpack(batch_array)
  word_class_list = tf.unpack(embedding)
  index_list_of_closest_word = []
  for eacharray in array_list:
    list_of_distances = []
    for eachwordclass in word_class_list:
      list_of_distances.append(cosine_distance(eacharray, eachwordclass))
    index_list_of_closest_word.append(tf.argmax(tf.pack(list_of_distances)))

However, this approach is terribly inefficient. Is there perhaps a more efficient manner to do this? I know word2vec does this pretty fast and tensorflow, with the power of a gpu, should be able to do these batch calculations in parallel.

Thanks!

Olivier Moindrot · Accepted Answer

The cosine similarity formula is:

The inputs you have are:

embedding: the embedding matrix, of shape [vocab_size, embedding_size]
batch_array: a batch of embeddings, to which you want to find the closest words, of shape [batch_size, embedding_size]

embedding = tf.placeholder(tf.float32, [vocab_size, embedding_size])
batch_array = tf.placeholder(tf.float32, [batch_size, embedding_size])

To compute the cosine similarity, you can first L2 normalize both inputs:
(you may want to store the normed embedding, as you are going to reuse it a lot)

normed_embedding = tf.nn.l2_normalize(embedding, dim=1)
normed_array = tf.nn.l2_normalize(batch_array, dim=1)

Then you have to compute the dot products of all words (vocab_size in total) vs. all arrays from the batch (batch_size in total):

cosine_similarity = tf.matmul(normed_array, tf.transpose(normed_embedding, [1, 0]))

You can finally compute the argmax for each element of the batch:

closest_words = tf.argmax(cosine_similarity, 1)  # shape [batch_size], type int64

Efficiently Finding Closest Word In TensorFlow Embedding

Tags:

tensorflow

distance

embedding

LeavesBreathe

1 Answers

Olivier Moindrot

Recent Activity

Donate For Us

Efficiently Finding Closest Word In TensorFlow Embedding

Tags:

tensorflow

distance

embedding

LeavesBreathe

1 Answers

Olivier Moindrot

Related questions

Recent Activity

Donate For Us