I understand autograd
is used to imply automatic differentiation. But what exactly is tape-based autograd
in Pytorch
and why there are so many discussions that affirm or deny it.
For example:
this
In pytorch, there is no traditional sense of tape
and this
We don’t really build gradient tapes per se. But graphs.
but not this
Autograd is now a core torch package for automatic differentiation. It uses a tape based system for automatic differentiation.
And for further reference, please compare it with GradientTape
in Tensorflow
.
There are different types of automatic differentiation e.g. forward-mode
, reverse-mode
, hybrids
; (more explanation). The tape-based
autograd in Pytorch
simply refers to the uses of reverse-mode automatic differentiation, source. The reverse-mode auto diff is simply a technique used to compute gradients efficiently and it happens to be used by backpropagation, source.
Now, in PyTorch, Autograd is the core torch package for automatic differentiation. It uses a tape-based
system for automatic differentiation. In the forward phase, the autograd
tape will remember all the operations it executed, and in the backward phase, it will replay the operations.
Same in TensorFlow, to differentiate automatically, It also needs to remember what operations happen in what order during the forward pass. Then, during the backward pass, TensorFlow traverses this list of operations in reverse order to compute gradients. Now, TensorFlow provides the tf.GradientTape
API for automatic differentiation; that is, computing the gradient of computation with respect to some inputs, usually tf.Variables
. TensorFlow records relevant operations executed inside the context of a tf.GradientTape
onto a tape. TensorFlow then uses that tape to compute the gradients of a recorded computation using reverse mode differentiation.
So, as we can see from the high-level viewpoint, both are doing the same operation. However, during the custom training loop, the forward
pass and calculation of the loss
are more explicit in TensorFlow
as it uses tf.GradientTape
API scope whereas in PyTorch
it's implicit for these operations but it requires to set required_grad
flags to False
temporarily while updating the training parameters (weights and biases). For that, it uses torch.no_grad
API explicitly. In other words, TensorFlow's tf.GradientTape()
is similar to PyTorch's loss.backward()
. Below is the simplistic form in the code of the above statements.
# TensorFlow
[w, b] = tf_model.trainable_variables
for epoch in range(epochs):
with tf.GradientTape() as tape:
# forward passing and loss calculations
# within explicit tape scope
predictions = tf_model(x)
loss = squared_error(predictions, y)
# compute gradients (grad)
w_grad, b_grad = tape.gradient(loss, tf_model.trainable_variables)
# update training variables
w.assign(w - w_grad * learning_rate)
b.assign(b - b_grad * learning_rate)
# PyTorch
[w, b] = torch_model.parameters()
for epoch in range(epochs):
# forward pass and loss calculation
# implicit tape-based AD
y_pred = torch_model(inputs)
loss = squared_error(y_pred, labels)
# compute gradients (grad)
loss.backward()
# update training variables / parameters
with torch.no_grad():
w -= w.grad * learning_rate
b -= b.grad * learning_rate
w.grad.zero_()
b.grad.zero_()
FYI, in the above, the trainable variables (w
, b
) are manually updated in both frameworks but we generally use an optimizer (e.g. adam
) to do the job.
# TensorFlow
# ....
# update training variables
optimizer.apply_gradients(zip([w_grad, b_grad], model.trainable_weights))
# PyTorch
# ....
# update training variables / parameters
optimizer.step()
optimizer.zero_grad()
I suspect this comes from two different uses of the word 'tape' in the context of automatic differentiation.
When people say that pytorch is not tape-based, they mean it uses Operator Overloading as opposed to [tape-based] Source Transformation for automatic differentiation.
[Operator overloading] relies on a language’s ability to redefine the meaning of functions and operators. All primitives are overloaded so that they additionally perform a tracing operation: The primitive is logged onto a ‘tape’, along with its inputs to ensure that those intermediate variables are kept alive. At the end of the function’s execution, this tape contains a linear trace of all the numerical operations in the program. Derivatives can be calculated by walking this tape in reverse. [...]
OO is the technique used by PyTorch, Autograd, and Chainer [37]....
Tape-based Frameworks such as ADIFOR [8] and Tapenade [20] for Fortran and C use a global stack also called a ‘tape’2 to ensure that intermediate variables are kept alive. The original (primal) function is augmented so that it writes intermediate variables to the tape during the forward pass, and the adjoint program will read intermediate variables from the tape during the backward pass. More recently, tape-based ST was implemented for Python in the ML framework Tangent [38].
...
2 The tape used in ST stores only the intermediate variables, whereas the tape in OO is a program trace that stores the executed primitives as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With