Recently I tried to train a CNN in TF using float16. To my surprise it is broken in various ways even though TF claims to support it for a while. For example, float16 optimization causes NaN loss already on the second step regardless of the network.
import tensorflow as tf
import numpy as np
slim = tf.contrib.slim
dtype = tf.float16
shape = (4, 16, 16, 3)
inpt = tf.placeholder(dtype, shape, name='input')
net = slim.conv2d(inpt, 16, [3, 3], scope='conv',
        weights_initializer=tf.zeros_initializer(),
        # normalizer_fn=slim.batch_norm
        )
loss = tf.reduce_mean(net)
opt = tf.train.AdamOptimizer(1e-3)
train_op = slim.learning.create_train_op(loss, opt)
val = np.zeros(shape)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(2):
        print(sess.run(train_op, feed_dict={inpt: val}))
To my understanding it is clearly a bug: I apply zero convolutions on zero input, I should get zero gradients that don't change zero loss. It just can't diverge. If dtype is float32 it works. NaN loss occurs both on CPU and GPU versions.
However, I was dismissed in GH issues, a random dude closed this issue saying that it is intended behaviour: https://github.com/tensorflow/tensorflow/issues/7226
If you uncomment the line with BN, it will break already on graph construction time because BN assumes moving averages (and beta, gamma) are always float32 and does not cast them properly. This issue was also closed and apparently ignored: https://github.com/tensorflow/tensorflow/issues/7164
I feel like I am talking to a first line IT support of an ISP.
Can anybody explain how I should train with float16 when such a simple "network" fails horribly? And what is the recommended way to report bugs now?
It looks like you need a slightly larger epsilon to avoid numerical instability with zero moments in AdamOptimizer (default is 1e-8). This works for me with float16:
opt = tf.train.AdamOptimizer(1e-3, epsilon=1e-4)
It would be reasonable to request that epsilon be set based on dtype (and presumably such a request, or better yet a pull request, would be met with a more positive response on GitHub). Note that GradientDescentOptimizer has no such issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With