Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the difference between GradientTape, implicit_gradients, gradients_function and implicit_value_and_gradients?

Tags:

tensorflow

I'm trying to switch to TensorFlow eager mode and I find the documentation of GradientTape, implicit_gradients, gradients_function and implicit_value_and_gradients confusing.

What's the difference between them? When should I use one over the other?

The intro point in the documentation does not mention implicit* functions at all, yet almost all of the examples in TensorFlow repository seems to use that method for computing gradients.

like image 869
Milad Avatar asked Apr 30 '18 10:04

Milad


People also ask

What is GradientTape?

GradientTape provides hooks that give the user control over what is or is not watched. To record gradients with respect to a tf.Tensor , you need to call GradientTape.watch(x) : x = tf. constant(3.0) tape.

What does gradient tape watch do?

watch. Ensures that tensor is being traced by this tape. a Tensor/Variable or list of Tensors/Variables. if it encounters something that is not a tensor.

When should you use a watch tape?

The call to tape. watch(x) is necessary if the function is called say as foo(tf. constant(3.14)) , but is not when it is passed in a variable directly, such as foo(tf. Variable(3.14)) .


1 Answers

There are 4 ways to automatically compute gradients when eager execution is enabled (actually, they also work in graph mode):

  • tf.GradientTape context records computations so that you can call tfe.gradient() to get the gradients of any tensor computed while recording with regards to any trainable variable.
  • tfe.gradients_function() takes a function (say f()) and returns a gradient function (say fg()) that can compute the gradients of the outputs of f() with regards to the parameters of f() (or a subset of them).
  • tfe.implicit_gradients() is very similar but fg() computes the gradients of the outputs of f() with regards to all trainable variables these outputs depend on.
  • tfe.implicit_value_and_gradients() is almost identical but fg() also returns the output of the function f().

Usually, in Machine Learning, you will want to compute the gradients of the loss with regards to the model parameters (ie. variables), and you will generally also be interested in the value of the loss itself. For this use case, the simplest and most efficient options are tf.GradientTape and tfe.implicit_value_and_gradients() (the other two options do not give you the value of the loss itself, so if you need it, it will require extra computations). I personally prefer tfe.implicit_value_and_gradients() when writing production code, and tf.GradientTape when experimenting in a Jupyter notebook.

Edit: In TF 2.0, it seems that only tf.GradientTape remains. Maybe the other functions will be added back, but I wouldn't count on it.

Detailed example

Let's create a small function to highlight the differences:

import tensorflow as tf
import tensorflow.contrib.eager as tfe
tf.enable_eager_execution()

w1 = tfe.Variable(2.0)
w2 = tfe.Variable(3.0)
​
def weighted_sum(x1, x2):
    return w1 * x1 + w2 * x2

s = weighted_sum(5., 7.)
print(s.numpy()) # 31

Using tf.GradientTape

Within a GradientTape context, all operations are recorded, then you can compute the gradients of any tensor computed within the context, with regards to any trainable variable. For example, this code computes s within the GradientTape context, and then computes the gradient of s with regards to w1. Since s = w1 * x1 + w2 * x2, the gradient of s with regards to w1 is x1:

with tf.GradientTape() as tape:
    s = weighted_sum(5., 7.)
​
[w1_grad] = tape.gradient(s, [w1])
print(w1_grad.numpy()) # 5.0 = gradient of s with regards to w1 = x1

Using tfe.gradients_function()

This function returns another function that can compute the gradients of a function's returned value with regards to its parameters. For example, we can use it to define a function that will compute the gradients of s with regards to x1 and x2:

grad_fn = tfe.gradients_function(weighted_sum)
x1_grad, x2_grad = grad_fn(5., 7.)
print(x1_grad.numpy()) # 2.0 = gradient of s with regards to x1 = w1

In the context of optimization, it would make more sense compute gradients with regards to variables that we can tweak. For this, we can change the weighted_sum() function to take w1 and w2 as parameters as well, and tell tfe.gradients_function() to only consider the parameters named "w1" and "w2":

def weighted_sum_with_weights(w1, x1, w2, x2):
    return w1 * x1 + w2 * x2

grad_fn = tfe.gradients_function(weighted_sum_with_weights, params=["w1", "w2"])
[w1_grad, w2_grad] = grad_fn(w1, 5., w2, 7.)
print(w2_grad.numpy()) # 7.0 = gradient of s with regards to w2 = x2

Using tfe.implicit_gradients()

This function returns another function that can compute the gradients of a function's returned value with regards to all trainable variables it depends on. Going back to the first version of weighted_sum(), we can use it to compute the gradients of s with regards to w1 and w2 without having to explicitly pass these variables. Note that the gradient function returns a list of gradient/variable pairs:

grad_fn = tfe.implicit_gradients(weighted_sum)
[(w1_grad, w1_var), (w2_grad, w2_var)] = grad_fn(5., 7.)
print(w1_grad.numpy()) # 5.0 = gradient of s with regards to w1 = x1

assert w1_var is w1
assert w2_var is w2

This function does seem like the simplest and most useful option, since generally we are interested in computing the gradients of the loss with regards to the model parameters (ie. variables). Note: try making w1 untrainable (w1 = tfe.Variable(2., trainable=False)) and redefine weighted_sum(), and you will see that grad_fn only returns the gradient of s with regards to w2.

Using tfe.implicit_value_and_gradients()

This function is almost identical to implicit_gradients() except the function it creates also returns the result of the function being differentiated (in this case weighted_sum()):

grad_fn = tfe.implicit_value_and_gradients(weighted_sum)
s, [(w1_grad, w1_var), (w2_grad, w2_var)] = grad_fn(5., 7.)
print(s.numpy()) # 31.0 = s = w1 * x1 + w2 * x2

When you need both the output of a function and its gradients, this function can give you a nice performance boost, since you get the output of the function for free when computing the gradients using autodiff.

like image 130
MiniQuark Avatar answered Oct 22 '22 02:10

MiniQuark