Tensorflow issue with softmax

Tags:

I have a Tensorflow multiclass classifier that is generating nan or inf while computing probabilities using tf.nn.softmax. See the following snippet (logits is of shape batch_size x 6, since I have 6 classes and the output is one-hot encoded). batch_size is 1024.

logits = tf.debugging.check_numerics(logits, message='bad logits', name=None)
probabilities = tf.nn.softmax(logits=logits, name='Softmax')
probabilities = tf.debugging.check_numerics(probabilities, message='bad probabilities', name=None)

The classifier fails on the last statement as it finds nan or inf in probabilities. logits are clean, otherwise the first statement would have failed.

From what I read about tf.nn.softmax, it can handle very large and very small values in logits. I have verified this in interactive mode.

>>> with tf.Session() as s:
...   a = tf.constant([[1000, 10], [-100, -200], [3, 4.0]])
...   sm = tf.nn.softmax(logits=a, name='Softmax')
...   print(a.eval())
...   print(sm.eval())
...
[[1000.   10.]
 [-100. -200.]
 [   3.    4.]]
[[1.         0.        ]
 [1.         0.        ]
 [0.26894143 0.7310586 ]]

I then tried clipping the values in logits and the whole thing now works. See the modified snippet below.

logits = tf.debugging.check_numerics(logits, message='logits', name=None)
safe_logits = tf.clip_by_value(logits, -15.0, 15.0)
probabilities = tf.nn.softmax(logits=safe_logits, name='Softmax')
probabilities = tf.debugging.check_numerics(probabilities, message='bad probabilities', name=None)

In second statement, I am clipping the values in logits to -15 and 15, and that somehow prevents nan/inf in softmax computation. So, I was able to fix the issue at hand.

However, I still don't understand why this clipping is working? (I should mention that clipping between -20 and 20 does not work and the model fails with nan or inf in probabilities).

Could someone help me understand why this is the case?

I am using tensorflow 1.15.0, running on a 64-bit instance.

555

asked Aug 30 '21 18:08

Nik

Video Answer

1 Answers

The first place to look was the values themselves, which you already did. The second place to look would be the gradients. Even if the value appears reasonable, if the gradient is very steep, backprop will eventually explode the gradient and value.

For example, if the logits are generated by something like log(x), an x of 0.001 will generate -6.9. Looks pretty benign. But the gradient is 1000! That would quickly explode the gradients and values during backprop / forward prop.

# Pretend this is the source value that is fed to a function that generates the logit. 
>>> x = tf.Variable(0.001)

# Let's operate on the source value to generate the logit. 
>>> with tf.GradientTape() as tape:
...   y = tf.math.log(x)
... 

# The logit looks okay... -6.9. 
>>> y
<tf.Tensor: shape=(), dtype=float32, numpy=-6.9077554>

# But the gradient is exploding. 
>>> tape.gradient(y,x)
<tf.Tensor: shape=(), dtype=float32, numpy=999.99994>
>>>

Clipping the logit would appear to focus on generating smaller values to feed to softmax, but that's probably not why it's helping. (In fact, softmax can handle a logit with value tf.float32.max no problem, so it's really unlikely the value of the logit is the issue). What may really be happening is that when you clip to 15, you are also setting the gradient to zero when the logit would otherwise be 20 with an explosive gradient. So clipping the value also introduces a clipped gradient.

# This is same source variable as above. 
>>> x = tf.Variable(0.001)

# Now let's operate with clipping. 
>>> with tf.GradientTape() as tape:
...   y = tf.clip_by_value(tf.math.log(x), -1., 1.)
... 

# The clipped logit still looks okay... 
>>> y
<tf.Tensor: shape=(), dtype=float32, numpy=-1.0>

# What may be more important is that the clipping has also zeroed out the gradient
>>> tape.gradient(y,x)
<tf.Tensor: shape=(), dtype=float32, numpy=0.0>

answered Oct 22 '22 19:10

Yaoshiang

Related questions
                            
                                Pandas: Appending a row of boolean values to df using `loc` changes to `int`
                            
                                What is hp_metric in TensorBoard and how to get rid of it?
                            
                                Non-hashable static arguments are not supported in Jax when using vmap
                            
                                multiple inheritance python Issue
                            
                                Python import mechanism and module mocks
                            
                                Using a DB dependency in FastAPI without having to pass it through a function tree
                            
                                Broadcast and concatenate ragged tensors
                            
                                np.linalg.norm ord=2 not giving Euclidean norm
                            
                                Google Cloud Run does not load .env file
                            
                                Is it possible to run a local python script with a remote ssh interpreter via Visual Studio Code?
                            
                                Behavior of __new__ in a metaclass (also in context of inheritance)
                            
                                Python Multiprocessing Pool as Decorator
                            
                                Drawing a neural network
                            
                                Return value from one Airflow DAG into another one
                            
                                sample from randomly generated numbers?
                            
                                Vectorization or efficient way to calculate Longest Increasing subsequence of tuples with Pandas
                            
                                How to build a heatmap?
                            
                                Django to return a view with TokenAuthentication for WebView
                            
                                Why np.hypot and np.subtract.outer very fast compared to vanilla broadcast ? Using Numba for speedup numpy in parallel for distance matrix calculation
                            
                                Can I use abstract methods to import file-specific formatting of (Python) pandas data?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Tensorflow issue with softmax

Tags:

python

tensorflow

softmax

numerical-stability

Nik

People also ask

Video Answer

1 Answers

Yaoshiang

Recent Activity

Donate For Us