Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Debugging nans in the backward pass

Tags:

tensorflow

I'm trying to debug a somewhat complicated and non-canonical NN architecture. Computing the forward pass is fine and is giving me the expected results, but when I try to optimize using Adam or any of the standard optimizers, even after one iteration with a very small learning rate I get nans everywhere. I'm trying to localize them and was wondering if there's a way to catch the first occurrence of a nan and detect in which op it arose? I tried tf.add_check_numerics_ops() but it doesn't appear to be doing anything, or perhaps I'm using it incorrectly.

like image 943
Mohammed AlQuraishi Avatar asked Dec 02 '15 15:12

Mohammed AlQuraishi


Video Answer


2 Answers

Debugging NaNs can be tricky, especially if you have a large network. tf.add_check_numerics_ops() adds ops to the graph that assert that each floating point tensor in the graph does not contain any NaN values, but does not run these checks by default. Instead it returns an op that you can run periodically, or on every step, as follows:

train_op = ...
check_op = tf.add_check_numerics_ops()

sess = tf.Session()
sess.run([train_op, check_op])  # Runs training and checks for NaNs
like image 143
mrry Avatar answered Oct 22 '22 19:10

mrry


Maybe you could add Print ops to suspect ops print values, something like this

print_ops = []
for op in ops:
  print_ops.append(tf.Print(op, [op],
                   message='%s :' % op.name, summarize=10))
print_op = tf.group(*print_ops)
sess.run([train_op, print_op])

To add to all ops, you could do a loop along the lines of add_check_numerics_ops.

like image 2
Yaroslav Bulatov Avatar answered Oct 22 '22 19:10

Yaroslav Bulatov