This is the link to TF optimizer class https://www.tensorflow.org/versions/r0.12/api_docs/python/train/optimizers
GATE_NONE: Take the simple case of a matmul op on two vectors 'x' and 'y'. let the output be L. Now gradient of L wrt x is y and gradient of L wrt y is xT (x transpose). with GATE_NONE it could so happen that the gradient wrt x is applied to modify x before the gradient for y is even calculated. Now when the gradient wrt y is calculated it would be computed equal to modified x which is an error. Of course it won't happen in such a simple case but you could imagine it could happen in more complex/extreme cases
GATE_OP: For each Op, make sure all gradients are computed before they are used. This prevents race conditions for Ops that generate gradients for multiple inputs where the gradients depend on the inputs. (You could see how this prevents the problem of GATE_NONE, though at the price of some parallelism).
GATE_GRAPH: Make sure all gradients for all variables are computed before any one of them is used. This provides the least parallelism but can be useful if you want to process all gradients before applying any of them.(an example of use case is clipping gradients according to global norm before applying them)
In the same page that you have linked, if you scroll down a little bit, it says:
gate_gradients argument that controls the degree of parallelism during the application of the gradients
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With