Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does embedding do in tensorflow

I am reading an example of using RNN with tensorflow here: ptb_word_lm.py

I can't figure out what the embedding and embedding_lookup are doing here. How can it add another dimension to the tensor? Going from (20, 25) to (20, 25, 200). In this case (20,25) is a batch-size of 20 with 25 time steps. I can't understand how/why you can add the hidden_size of the cell as a dimension of the input data? Typically the input data would be a matrix of size [batch_size, num_features] and the model would map num_features ---> hidden_dims with a matrix of size [num_features, hidden_dims] yielding an output of size [batch-size, hidden-dims]. So how can hidden_dims be a dimension of the input tensor?

input_data, targets = reader.ptb_producer(train_data, 20, 25)
cell = tf.nn.rnn_cell.BasicLSTMCell(200, forget_bias=1.0, state_is_tuple=True)
initial_state = cell.zero_state(20, tf.float32)
embedding = tf.get_variable("embedding", [10000, 200], dtype=tf.float32)
inputs = tf.nn.embedding_lookup(embedding, input_data)

input_data_train # <tf.Tensor 'PTBProducer/Slice:0' shape=(20, 25) dtype=int32>
inputs # <tf.Tensor 'embedding_lookup:0' shape=(20, 25, 200) dtype=float32>

outputs = []
state = initial_state
for time_step in range(25):
    if time_step > 0: 
        tf.get_variable_scope().reuse_variables()

    cell_output, state = cell(inputs[:, time_step, :], state)
    outputs.append(cell_output)

output = tf.reshape(tf.concat(1, outputs), [-1, 200])

outputs # list of 20: <tf.Tensor 'BasicLSTMCell/mul_2:0' shape=(20, 200) dtype=float32>
output # <tf.Tensor 'Reshape_2:0' shape=(500, 200) dtype=float32>

softmax_w = tf.get_variable("softmax_w", [config.hidden_size, config.vocab_size], dtype=tf.float32)
softmax_b = tf.get_variable("softmax_b", [config.hidden_size, config.vocab_size], dtype=tf.float32)
logits = tf.matmul(output, softmax_w) + softmax_b

loss = tf.nn.seq2seq.sequence_loss_by_example([logits], [tf.reshape(targets, [-1])],[tf.ones([20*25], dtype=tf.float32)])
cost = tf.reduce_sum(loss) / batch_size
like image 944
anthonybell Avatar asked Oct 21 '16 19:10

anthonybell


2 Answers

ok, I'm not going to try and explain this specific code, but I will try and answer the "what is an embedding?" part of the title.

Basically it's a mapping of the original input data into some set of real-valued dimensions, and the "position" of the original input data in those dimensions is organized to improve the task.

In tensorflow, if you imagine some text input field has "king", "queen", "girl","boy", and you have 2 embedding dimensions. Hopefully the backprop will train the embedding to put the concept of royalty on one axis and gender on the other. So in this case, what was a 4 categorical value feature gets "boiled" down to a floating point embedding feature with 2 dimensions.

They are implemented using a lookup table, either hashed from the original or from a dictionary ordering. For a fully trained one, You might put in "Queen", and you get out say [1.0,1.0], Put in "Boy" and you get out [0.0,0.0].

Tensorflow does backprop of the error INTO this lookup table, and hopefully what starts off as a randomly initialized dictionary will gradually become like we see above.

Hope this helps. If not, look at: http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/

like image 138
W D Avatar answered Nov 14 '22 23:11

W D


At simplest,

input_data: Batch of sequence of word IDs (with shape (20,25))

inputs: Batch of sequence of word embeddings (with shape (20,25,200))

How does input_data becomes inputs you might ask? This is what learning word embeddings does. The easiest way to imagine is,

  1. unwrap the input_data to a single batch of shape (20*25,).
  2. Now assign a vector of size 200 for each element in that unwrapped input_data which gives you a matrix of shape (20*25,200).
  3. Now, reshape the matrix to shape (20,25,200).

This is because, embedding learning is not a time-series process. You learn word embeddings with a feed forward network. Next important question would be, how do you learn the word embeddings.

  1. Initialise a huge Tensorflow variable of size (vocabulary_size, 200) (i.e. embedding in the code)
  2. Optimise the embedding so that a given word should be able to predict any word from its context. (e.g. in "dog barked at the mailman", if "at" is the target word "dog", "barked", "the" and "mailman" are context words)
  3. This process give you a vector (200 long in this example) for each word, such that semantics are preserved (i.e. vector of "dog" is close to "cat", but far away from "pen").

Here's an overview of what I just explained.

Image:embeddings

like image 34
thushv89 Avatar answered Nov 14 '22 22:11

thushv89