I am reading an example of using RNN with tensorflow here: ptb_word_lm.py
I can't figure out what the embedding
and embedding_lookup
are doing here. How can it add another dimension to the tensor? Going from (20, 25) to (20, 25, 200). In this case (20,25) is a batch-size of 20 with 25 time steps. I can't understand how/why you can add the hidden_size
of the cell as a dimension of the input data? Typically the input data would be a matrix of size [batch_size, num_features]
and the model would map num_features
---> hidden_dims
with a matrix of size [num_features, hidden_dims]
yielding an output of size [batch-size, hidden-dims]
. So how can hidden_dims
be a dimension of the input tensor?
input_data, targets = reader.ptb_producer(train_data, 20, 25)
cell = tf.nn.rnn_cell.BasicLSTMCell(200, forget_bias=1.0, state_is_tuple=True)
initial_state = cell.zero_state(20, tf.float32)
embedding = tf.get_variable("embedding", [10000, 200], dtype=tf.float32)
inputs = tf.nn.embedding_lookup(embedding, input_data)
input_data_train # <tf.Tensor 'PTBProducer/Slice:0' shape=(20, 25) dtype=int32>
inputs # <tf.Tensor 'embedding_lookup:0' shape=(20, 25, 200) dtype=float32>
outputs = []
state = initial_state
for time_step in range(25):
if time_step > 0:
tf.get_variable_scope().reuse_variables()
cell_output, state = cell(inputs[:, time_step, :], state)
outputs.append(cell_output)
output = tf.reshape(tf.concat(1, outputs), [-1, 200])
outputs # list of 20: <tf.Tensor 'BasicLSTMCell/mul_2:0' shape=(20, 200) dtype=float32>
output # <tf.Tensor 'Reshape_2:0' shape=(500, 200) dtype=float32>
softmax_w = tf.get_variable("softmax_w", [config.hidden_size, config.vocab_size], dtype=tf.float32)
softmax_b = tf.get_variable("softmax_b", [config.hidden_size, config.vocab_size], dtype=tf.float32)
logits = tf.matmul(output, softmax_w) + softmax_b
loss = tf.nn.seq2seq.sequence_loss_by_example([logits], [tf.reshape(targets, [-1])],[tf.ones([20*25], dtype=tf.float32)])
cost = tf.reduce_sum(loss) / batch_size
ok, I'm not going to try and explain this specific code, but I will try and answer the "what is an embedding?" part of the title.
Basically it's a mapping of the original input data into some set of real-valued dimensions, and the "position" of the original input data in those dimensions is organized to improve the task.
In tensorflow, if you imagine some text input field has "king", "queen", "girl","boy", and you have 2 embedding dimensions. Hopefully the backprop will train the embedding to put the concept of royalty on one axis and gender on the other. So in this case, what was a 4 categorical value feature gets "boiled" down to a floating point embedding feature with 2 dimensions.
They are implemented using a lookup table, either hashed from the original or from a dictionary ordering. For a fully trained one, You might put in "Queen", and you get out say [1.0,1.0], Put in "Boy" and you get out [0.0,0.0].
Tensorflow does backprop of the error INTO this lookup table, and hopefully what starts off as a randomly initialized dictionary will gradually become like we see above.
Hope this helps. If not, look at: http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
At simplest,
input_data
: Batch of sequence of word IDs (with shape (20,25))
inputs
: Batch of sequence of word embeddings (with shape (20,25,200))
How does input_data
becomes inputs
you might ask? This is what learning word embeddings does. The easiest way to imagine is,
input_data
to a single batch of shape (20*25,)
.200
for each element in that unwrapped input_data
which gives you a matrix of shape (20*25,200)
. (20,25,200)
.This is because, embedding learning is not a time-series process. You learn word embeddings with a feed forward network. Next important question would be, how do you learn the word embeddings.
(vocabulary_size, 200)
(i.e. embedding
in the code)embedding
so that a given word should be able to predict any word from its context. (e.g. in "dog barked at the mailman", if "at" is the target word "dog", "barked", "the" and "mailman" are context words)200
long in this example) for each word, such that semantics are preserved (i.e. vector of "dog" is close to "cat", but far away from "pen").Here's an overview of what I just explained.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With