Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TensorFlow float16 support is broken

Tags:

tensorflow

Recently I tried to train a CNN in TF using float16. To my surprise it is broken in various ways even though TF claims to support it for a while. For example, float16 optimization causes NaN loss already on the second step regardless of the network.

import tensorflow as tf
import numpy as np

slim = tf.contrib.slim

dtype = tf.float16
shape = (4, 16, 16, 3)

inpt = tf.placeholder(dtype, shape, name='input')
net = slim.conv2d(inpt, 16, [3, 3], scope='conv',
        weights_initializer=tf.zeros_initializer(),
        # normalizer_fn=slim.batch_norm
        )
loss = tf.reduce_mean(net)
opt = tf.train.AdamOptimizer(1e-3)
train_op = slim.learning.create_train_op(loss, opt)

val = np.zeros(shape)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(2):
        print(sess.run(train_op, feed_dict={inpt: val}))

To my understanding it is clearly a bug: I apply zero convolutions on zero input, I should get zero gradients that don't change zero loss. It just can't diverge. If dtype is float32 it works. NaN loss occurs both on CPU and GPU versions.

However, I was dismissed in GH issues, a random dude closed this issue saying that it is intended behaviour: https://github.com/tensorflow/tensorflow/issues/7226

If you uncomment the line with BN, it will break already on graph construction time because BN assumes moving averages (and beta, gamma) are always float32 and does not cast them properly. This issue was also closed and apparently ignored: https://github.com/tensorflow/tensorflow/issues/7164

I feel like I am talking to a first line IT support of an ISP.

Can anybody explain how I should train with float16 when such a simple "network" fails horribly? And what is the recommended way to report bugs now?

like image 927
Konstantin Shmelkov Avatar asked Feb 06 '17 10:02

Konstantin Shmelkov


1 Answers

It looks like you need a slightly larger epsilon to avoid numerical instability with zero moments in AdamOptimizer (default is 1e-8). This works for me with float16:

opt = tf.train.AdamOptimizer(1e-3, epsilon=1e-4)

It would be reasonable to request that epsilon be set based on dtype (and presumably such a request, or better yet a pull request, would be met with a more positive response on GitHub). Note that GradientDescentOptimizer has no such issue.

like image 174
Allen Lavoie Avatar answered Oct 20 '22 19:10

Allen Lavoie