Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is tape-based autograd in Pytorch?

I understand autograd is used to imply automatic differentiation. But what exactly is tape-based autograd in Pytorch and why there are so many discussions that affirm or deny it.

For example:

this

In pytorch, there is no traditional sense of tape

and this

We don’t really build gradient tapes per se. But graphs.

but not this

Autograd is now a core torch package for automatic differentiation. It uses a tape based system for automatic differentiation.

And for further reference, please compare it with GradientTape in Tensorflow.

like image 431
Long Avatar asked Nov 16 '20 10:11

Long


2 Answers

There are different types of automatic differentiation e.g. forward-mode, reverse-mode, hybrids; (more explanation). The tape-based autograd in Pytorch simply refers to the uses of reverse-mode automatic differentiation, source. The reverse-mode auto diff is simply a technique used to compute gradients efficiently and it happens to be used by backpropagation, source.


Now, in PyTorch, Autograd is the core torch package for automatic differentiation. It uses a tape-based system for automatic differentiation. In the forward phase, the autograd tape will remember all the operations it executed, and in the backward phase, it will replay the operations.

Same in TensorFlow, to differentiate automatically, It also needs to remember what operations happen in what order during the forward pass. Then, during the backward pass, TensorFlow traverses this list of operations in reverse order to compute gradients. Now, TensorFlow provides the tf.GradientTape API for automatic differentiation; that is, computing the gradient of computation with respect to some inputs, usually tf.Variables. TensorFlow records relevant operations executed inside the context of a tf.GradientTape onto a tape. TensorFlow then uses that tape to compute the gradients of a recorded computation using reverse mode differentiation.

So, as we can see from the high-level viewpoint, both are doing the same operation. However, during the custom training loop, the forward pass and calculation of the loss are more explicit in TensorFlow as it uses tf.GradientTape API scope whereas in PyTorch it's implicit for these operations but it requires to set required_grad flags to False temporarily while updating the training parameters (weights and biases). For that, it uses torch.no_grad API explicitly. In other words, TensorFlow's tf.GradientTape() is similar to PyTorch's loss.backward(). Below is the simplistic form in the code of the above statements.

# TensorFlow 
[w, b] = tf_model.trainable_variables
for epoch in range(epochs):
  with tf.GradientTape() as tape:
    # forward passing and loss calculations 
    # within explicit tape scope 
    predictions = tf_model(x)
    loss = squared_error(predictions, y)

  # compute gradients (grad)
  w_grad, b_grad = tape.gradient(loss, tf_model.trainable_variables)

  # update training variables 
  w.assign(w - w_grad * learning_rate)
  b.assign(b - b_grad * learning_rate)


# PyTorch 
[w, b] = torch_model.parameters()
for epoch in range(epochs):
  # forward pass and loss calculation 
  # implicit tape-based AD 
  y_pred = torch_model(inputs)
  loss = squared_error(y_pred, labels)

  # compute gradients (grad)
  loss.backward()
  
  # update training variables / parameters  
  with torch.no_grad():
    w -= w.grad * learning_rate
    b -= b.grad * learning_rate
    w.grad.zero_()
    b.grad.zero_()

FYI, in the above, the trainable variables (w, b) are manually updated in both frameworks but we generally use an optimizer (e.g. adam) to do the job.

# TensorFlow 
# ....
# update training variables 
optimizer.apply_gradients(zip([w_grad, b_grad], model.trainable_weights))

# PyTorch
# ....
# update training variables / parameters
optimizer.step()
optimizer.zero_grad()
like image 188
M.Innat Avatar answered Nov 20 '22 18:11

M.Innat


I suspect this comes from two different uses of the word 'tape' in the context of automatic differentiation.

When people say that pytorch is not tape-based, they mean it uses Operator Overloading as opposed to [tape-based] Source Transformation for automatic differentiation.

[Operator overloading] relies on a language’s ability to redefine the meaning of functions and operators. All primitives are overloaded so that they additionally perform a tracing operation: The primitive is logged onto a ‘tape’, along with its inputs to ensure that those intermediate variables are kept alive. At the end of the function’s execution, this tape contains a linear trace of all the numerical operations in the program. Derivatives can be calculated by walking this tape in reverse. [...]
OO is the technique used by PyTorch, Autograd, and Chainer [37].

...

Tape-based Frameworks such as ADIFOR [8] and Tapenade [20] for Fortran and C use a global stack also called a ‘tape’2 to ensure that intermediate variables are kept alive. The original (primal) function is augmented so that it writes intermediate variables to the tape during the forward pass, and the adjoint program will read intermediate variables from the tape during the backward pass. More recently, tape-based ST was implemented for Python in the ML framework Tangent [38].

...

2 The tape used in ST stores only the intermediate variables, whereas the tape in OO is a program trace that stores the executed primitives as well.

  • Automatic differentiation in ML: Where we are and where we should be going
like image 3
iacob Avatar answered Nov 20 '22 17:11

iacob