Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow issue with softmax

I have a Tensorflow multiclass classifier that is generating nan or inf while computing probabilities using tf.nn.softmax. See the following snippet (logits is of shape batch_size x 6, since I have 6 classes and the output is one-hot encoded). batch_size is 1024.

logits = tf.debugging.check_numerics(logits, message='bad logits', name=None)
probabilities = tf.nn.softmax(logits=logits, name='Softmax')
probabilities = tf.debugging.check_numerics(probabilities, message='bad probabilities', name=None)

The classifier fails on the last statement as it finds nan or inf in probabilities. logits are clean, otherwise the first statement would have failed.

From what I read about tf.nn.softmax, it can handle very large and very small values in logits. I have verified this in interactive mode.

>>> with tf.Session() as s:
...   a = tf.constant([[1000, 10], [-100, -200], [3, 4.0]])
...   sm = tf.nn.softmax(logits=a, name='Softmax')
...   print(a.eval())
...   print(sm.eval())
...
[[1000.   10.]
 [-100. -200.]
 [   3.    4.]]
[[1.         0.        ]
 [1.         0.        ]
 [0.26894143 0.7310586 ]]

I then tried clipping the values in logits and the whole thing now works. See the modified snippet below.

logits = tf.debugging.check_numerics(logits, message='logits', name=None)
safe_logits = tf.clip_by_value(logits, -15.0, 15.0)
probabilities = tf.nn.softmax(logits=safe_logits, name='Softmax')
probabilities = tf.debugging.check_numerics(probabilities, message='bad probabilities', name=None)

In second statement, I am clipping the values in logits to -15 and 15, and that somehow prevents nan/inf in softmax computation. So, I was able to fix the issue at hand.

However, I still don't understand why this clipping is working? (I should mention that clipping between -20 and 20 does not work and the model fails with nan or inf in probabilities).

Could someone help me understand why this is the case?

I am using tensorflow 1.15.0, running on a 64-bit instance.

like image 555
Nik Avatar asked Aug 30 '21 18:08

Nik


People also ask

What does TensorFlow softmax do?

Softmax is often used as the activation for the last layer of a classification network because the result could be interpreted as a probability distribution. The softmax of each vector x is computed as exp(x) / tf. reduce_sum(exp(x)) . The input values in are the log-odds of the resulting probability.

Which TensorFlow function can you use to calculate the softmax?

We compute the softmax and cross-entropy using tf. nn. softmax_cross_entropy_with_logits (it's one operation in TensorFlow, because it's very common, and it can be optimized).

What is dim softmax?

Softmax (dim=None)[source] Applies the Softmax function to an n-dimensional input Tensor rescaling them so that the elements of the n-dimensional output Tensor lie in the range [0,1] and sum to 1.


Video Answer


1 Answers

The first place to look was the values themselves, which you already did. The second place to look would be the gradients. Even if the value appears reasonable, if the gradient is very steep, backprop will eventually explode the gradient and value.

For example, if the logits are generated by something like log(x), an x of 0.001 will generate -6.9. Looks pretty benign. But the gradient is 1000! That would quickly explode the gradients and values during backprop / forward prop.

# Pretend this is the source value that is fed to a function that generates the logit. 
>>> x = tf.Variable(0.001)

# Let's operate on the source value to generate the logit. 
>>> with tf.GradientTape() as tape:
...   y = tf.math.log(x)
... 

# The logit looks okay... -6.9. 
>>> y
<tf.Tensor: shape=(), dtype=float32, numpy=-6.9077554>

# But the gradient is exploding. 
>>> tape.gradient(y,x)
<tf.Tensor: shape=(), dtype=float32, numpy=999.99994>
>>> 

Clipping the logit would appear to focus on generating smaller values to feed to softmax, but that's probably not why it's helping. (In fact, softmax can handle a logit with value tf.float32.max no problem, so it's really unlikely the value of the logit is the issue). What may really be happening is that when you clip to 15, you are also setting the gradient to zero when the logit would otherwise be 20 with an explosive gradient. So clipping the value also introduces a clipped gradient.

# This is same source variable as above. 
>>> x = tf.Variable(0.001)

# Now let's operate with clipping. 
>>> with tf.GradientTape() as tape:
...   y = tf.clip_by_value(tf.math.log(x), -1., 1.)
... 

# The clipped logit still looks okay... 
>>> y
<tf.Tensor: shape=(), dtype=float32, numpy=-1.0>

# What may be more important is that the clipping has also zeroed out the gradient
>>> tape.gradient(y,x)
<tf.Tensor: shape=(), dtype=float32, numpy=0.0>
like image 77
Yaoshiang Avatar answered Oct 22 '22 19:10

Yaoshiang