Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determinism in tensorflow gradient updates?

So I have a very simple NN script written in Tensorflow, and I am having a hard time trying to trace down where some "randomness" is coming in from.

I have recorded the

  • Weights,
  • Gradients,
  • Logits

of my network as I train, and for the first iteration, it is clear that everything starts off the same. I have a SEED value both for how data is read in, and a SEED value for initializing the weights of the net. Those I never change.

My problem is that on say the second iteration of every re-run I do, I start to see the gradients diverge, (by a small amount, like say, 1e-6 or so). However over time, this of course leads to non-repeatable behaviour.

What might the cause of this be? I dont know where any possible source of randomness might be coming from...

Thanks

like image 752
Spacey Avatar asked Oct 08 '16 23:10

Spacey


3 Answers

There's a good chance you could get deterministic results if you run your network on CPU (export CUDA_VISIBLE_DEVICES=), with single-thread in Eigen thread pool (tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=1)), one Python thread (no multi-threaded queue-runners that you get from ops like tf.batch), and a single well-defined operation order. Also using inter_op_parallelism_threads=1 may help in some scenarios.

One issue is that floating point addition/multiplication is non-associative, so one fool-proof way to get deterministic results is to use integer arithmetic or quantized values.

Barring that, you could isolate which operation is non-deterministic, and try to avoid using that op. For instance, there's tf.add_n op, which doesn't say anything about the order in which it sums the values, but different orders produce different results.

Getting deterministic results is a bit of an uphill battle because determinism is in conflict with performance, and performance is usually the goal that gets more attention. An alternative to trying to have exact same numbers on reruns is to focus on numerical stability -- if your algorithm is stable, then you will get reproducible results (ie, same number of misclassifications) even though exact parameter values may be slightly different

like image 92
Yaroslav Bulatov Avatar answered Nov 17 '22 07:11

Yaroslav Bulatov


The tensorflow reduce_sum op is specifically known to be non-deterministic. Furthermore, reduce_sum is used for calculating bias gradients.

This post discusses a workaround to avoid using reduce_sum (ie taking the dot product of any vector w/ a vector of all 1's is the same as reduce_sum)

like image 3
DankMasterDan Avatar answered Nov 17 '22 06:11

DankMasterDan


I have faced the same problem.. The working solution for me was to:

1- use tf.set_random_seed(1) in order to make all tf functions have the same seed every new run

2- Training the model using CPU not the GPU to avoid GPU non-deterministic operations due to precision.

like image 1
Mohamed Atef Avatar answered Nov 17 '22 06:11

Mohamed Atef