Why are my TensorFlow network weights and costs NaN when I use RELU activations?

Question

I can't get TensorFlow RELU activations (neither tf.nn.relu nor tf.nn.relu6) working without NaN values for activations and weights killing my training runs.

I believe I'm following all the right general advice. For example I initialize my weights with

weights = tf.Variable(tf.truncated_normal(w_dims, stddev=0.1))
biases = tf.Variable(tf.constant(0.1 if neuron_fn in [tf.nn.relu, tf.nn.relu6] else 0.0, shape=b_dims))

and use a slow training rate, e.g.,

tf.train.MomentumOptimizer(0.02, momentum=0.5).minimize(cross_entropy_loss)

But any network of appreciable depth results in NaN for cost and and at least some weights (at least in the summary histograms for them). In fact, the cost is often NaN right from the start (before training).

I seem to have these issues even when I use L2 (about 0.001) regularization, and dropout (about 50%).

Is there some parameter or setting that I should adjust to avoid these issues? I'm at a loss as to where to even begin looking, so any suggestions would be appreciated!

orome · Accepted Answer

Following He et. al (as suggested in lejlot's comment), initializing the weights of the l-th layer to a zero-mean Gaussian distribution with standard deviation

$\sqrt{\frac{2}{n_l}}$

where n_l is the flattened length of the the input vector or

stddev=np.sqrt(2 / np.prod(input_tensor.get_shape().as_list()[1:]))

results in weights that generally do not diverge.

Vincent Vanhoucke · Answer

If you use a softmax classifier at the top of your network, try to make the initial weights of the layer just below the softmax very small (e.g. std=1e-4). This makes the initial distribution of outputs of the network very soft (high temperature), and helps ensure that the first few steps of your optimization are not too large and numerically unstable.

Why are my TensorFlow network weights and costs NaN when I use RELU activations?

Tags:

nan

machine-learning

tensorflow

orome

2 Answers

orome

Vincent Vanhoucke

Recent Activity

Donate For Us

Why are my TensorFlow network weights and costs NaN when I use RELU activations?

Tags:

nan

machine-learning

tensorflow

orome

2 Answers

orome

Vincent Vanhoucke

Related questions

Recent Activity

Donate For Us