Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does one debug NaN values in TensorFlow?

I was running TensorFlow and I happen to have something yielding a NaN. I'd like to know what it is but I do not know how to do this. The main issue is that in a "normal" procedural program I would just write a print statement just before the operation is executed. The issue with TensorFlow is that I cannot do that because I first declare (or define) the graph, so adding print statements to the graph definition does not help. Are there any rules, advice, heuristics, anything to track down what might be causing the NaN?


In this case I know more precisely what line to look at because I have the following:

Delta_tilde = 2.0*tf.matmul(x,W) - tf.add(WW, XX) #note this quantity should always be positive because its pair-wise euclidian distance Z = tf.sqrt(Delta_tilde) Z = Transform(Z) # potentially some transform, currently I have it to return Z for debugging (the identity) Z = tf.pow(Z, 2.0) A = tf.exp(Z)  

when this line is present I have it that it returns NaN as declared by my summary writers. Why is this? Is there a way to at least explore what value Z has after its being square rooted?


For the specific example I posted, I tried tf.Print(0,Z) but with no success it printed nothing. As in:

Delta_tilde = 2.0*tf.matmul(x,W) - tf.add(WW, XX) #note this quantity should always be positive because its pair-wise euclidian distance Z = tf.sqrt(Delta_tilde) tf.Print(0,[Z]) # <-------- TF PRINT STATMENT Z = Transform(Z) # potentially some transform, currently I have it to return Z for debugging (the identity) Z = tf.pow(Z, 2.0) A = tf.exp(Z)  

I actually don't understand what tf.Print is suppose to do. Why does it need two arguments? If I want to print 1 tensor why would I need to pass 2? Seems bizarre to me.


I was looking at the function tf.add_check_numerics_ops() but it doesn't say how to use it (plus the docs seem to not be super helpful). Does anyone know how to use this?


Since I've had comments addressing the data might be bad, I am using standard MNIST. However, I am computing a quantity that is positive (pair-wise eucledian distance) and then square rooting it. Thus, I wouldn't see how the data specifically would be an issue.

like image 539
Charlie Parker Avatar asked Aug 07 '16 02:08

Charlie Parker


People also ask

What does NaN mean in TensorFlow?

Show activity on this post. For TensorFlow 2, inject some x=tf. debugging. check_numerics(x,'x is nan') into your code. They will throw an InvalidArgument error if x has any values that are not a number (NaN) or infinity (Inf).

How do you refer to NaN in Python?

Use math. isnan() or numpy. isnan() . x = float('nan') if math.

What is value NaN?

What are NaN values? NaN or Not a Number are special values in DataFrame and numpy arrays that represent the missing of value in a cell. In programming languages they are also represented, for example in Python they are represented as None value.


2 Answers

There are a couple of reasons WHY you can get a NaN-result, often it is because of too high a learning rate but plenty other reasons are possible like for example corrupt data in your input-queue or a log of 0 calculation.

Anyhow, debugging with a print as you describe cannot be done by a simple print (as this would result only in the printing of the tensor-information inside the graph and not print any actual values).

However, if you use tf.print as an op in bulding the graph (tf.print) then when the graph gets executed you will get the actual values printed (and it IS a good exercise to watch these values to debug and understand the behavior of your net).

However, you are using the print-statement not entirely in the correct manner. This is an op, so you need to pass it a tensor and request a result-tensor that you need to work with later on in the executing graph. Otherwise the op is not going to be executed and no printing occurs. Try this:

Z = tf.sqrt(Delta_tilde) Z = tf.Print(Z,[Z], message="my Z-values:") # <-------- TF PRINT STATMENT Z = Transform(Z) # potentially some transform, currently I have it to return Z for debugging (the identity) Z = tf.pow(Z, 2.0) 
like image 113
Phillip Bock Avatar answered Sep 28 '22 19:09

Phillip Bock


I used to find it's much tougher to pinpoint where the nans and infs may occur than to fix the bug. As a complementary to @scai's answer, I'd like to add some points here:

The debug module, you can imported by:

from tensorflow.python import debug as tf_debug 

is much better than any print or assert.

You can just add the debug function by changing your wrapper you session by:

sess = tf_debug.LocalCLIDebugWrapperSession(sess) sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan) 

And you'll prompt an command line interface, then you enter: run -f has_inf_or_nan and lt -f has_inf_or_nan to find where the nans or infs are. The first one is the first place where the catastrophe occurs. By the variable name you can trace the origin in your code.

Reference: https://developers.googleblog.com/2017/02/debug-tensorflow-models-with-tfdbg.html

like image 26
Lerner Zhang Avatar answered Sep 28 '22 19:09

Lerner Zhang