I am attempting to replicate the character level language modeling demonstrated in the excellent article http://karpathy.github.io/2015/05/21/rnn-effectiveness/ using Tensorflow.
So far my attempts have failed. My network typically outputs a single character after processing 800 or so characters. I believe I have fundamentally misunderstood the way tensor flow has implemented LSTMs, and perhaps rnns in general. I am finding the documentation to be difficult to follow.
Here is the essence of my code:
Graph definition
idata = tf.placeholder(tf.int32,[None,1]) #input byte, use value 256 for start and end of file
odata = tf.placeholder(tf.int32,[None,1]) #target output byte, ie, next byte in sequence..
source = tf.to_float(tf.one_hot(idata,257)) #input byte as 1-hot float
target = tf.to_float(tf.one_hot(odata,257)) #target output as 1-hot float
with tf.variable_scope("lstm01"):
cell1 = tf.nn.rnn_cell.BasicLSTMCell(257)
val1, state1 = tf.nn.dynamic_rnn(cell1, source, dtype=tf.float32)
output = val1
Loss Calculation
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(output, target))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
output_am = tf.argmax(output,2)
target_am = tf.argmax(target,2)
correct_prediction = tf.equal(output_am, target_am)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
Training
for i in range(0, source_data.size-1, batch_size):
start = i
stop = i + batch_size
i_data = source_data[start:stop].reshape([-1,1])
o_data = source_data[start+1:stop+1].reshape([-1,1])
train_step.run(feed_dict={idata: i_data, odata: o_data})
if i%(report_interval*batch_size) == 0:
batch_out, fa = sess.run([output_am, accuracy], feed_dict={idata: i_data, odata: o_data, keep_prob: 1.0})
print("step %d, training accuracy %s"%(i, str(fa)))
print("i_data sample: %s"%str(squeeze(i_data)))
print("o_data sample: %s"%str(squeeze(o_data)))
print("batch sample: %s"%str(squeeze(batch_out)))
Output, using 1MB Shakespere file to train
step 0, training accuracy 0.0
i_data sample: [ 256. 70. 105. 114. 115. 116. 32. 67. 105. 116.]
o_data sample: [ 70. 105. 114. 115. 116. 32. 67. 105. 116. 105.]
batch sample: [254 18 151 64 51 199 83 174 151 199]
step 400, training accuracy 0.2
i_data sample: [ 32. 98. 101. 32. 100. 111. 110. 101. 58. 32.]
o_data sample: [ 98. 101. 32. 100. 111. 110. 101. 58. 32. 97.]
batch sample: [ 32 101 32 32 32 32 10 32 101 32]
step 800, training accuracy 0.0
i_data sample: [ 112. 97. 114. 116. 105. 99. 117. 108. 97. 114.]
o_data sample: [ 97. 114. 116. 105. 99. 117. 108. 97. 114. 105.]
batch sample: [101 101 101 32 101 101 32 101 101 101]
step 1200, training accuracy 0.1
i_data sample: [ 63. 10. 10. 70. 105. 114. 115. 116. 32. 67.]
o_data sample: [ 10. 10. 70. 105. 114. 115. 116. 32. 67. 105.]
batch sample: [ 32 32 32 101 32 32 32 32 32 32]
step 1600, training accuracy 0.2
i_data sample: [ 32. 116. 105. 108. 108. 32. 116. 104. 101. 32.]
o_data sample: [ 116. 105. 108. 108. 32. 116. 104. 101. 32. 97.]
batch sample: [32 32 32 32 32 32 32 32 32 32]
This is clearly incorrect.
I think I am getting confused by the difference between 'batches' and 'sequences', and as to whether or not the state of the LSTM is preserved between what I call 'batches' (ie, sub-sequences)
I'm getting the impression that I've trained it using 'batches' of sequences of length 1, and that between each batch, state data is discarded. Consequently it is simply finding the most commonly occurring symbol.
Can anyone confirm this, or otherwise correct my mistake, and give some indication of how I should go about the task of character by character prediction using very long training sequences?
Many Thanks.
So your idata should have a shape of: [batch_size, maximum_sequence_length, 257]. (If not all sequences in a batch have the same length you need to pad as necessary, and be careful when computing losses that this is done only over non-padded values.)
The dynamic_rnn steps through your input by time for you. So, you only need to loop over batches.
Since, your second dimension of idata is 1 you are right that your effective sequence length is 1.
For a language model not character-based but using word embeddings take a look at this tutorial.
Other notes:
If you want to experiment with different number of units in the LSTM - consider adding a linear layer on top of the output to project each output (for batch entry i at time t) down to 257 which is the number of classes of your target.
No need to do a one-hot encoding of the target. Take a look at sparse_softmax_cross_entropy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With