Pytorch, what are the gradient arguments

Explanation

For neural networks, we usually use loss to assess how well the network has learned to classify the input image (or other tasks). The loss term is usually a scalar value. In order to update the parameters of the network, we need to calculate the gradient of loss w.r.t to the parameters, which is actually leaf node in the computation graph (by the way, these parameters are mostly the weight and bias of various layers such Convolution, Linear and so on).

According to chain rule, in order to calculate gradient of loss w.r.t to a leaf node, we can compute derivative of loss w.r.t some intermediate variable, and gradient of intermediate variable w.r.t to the leaf variable, do a dot product and sum all these up.

The gradient arguments of a Variable's backward() method is used to calculate a weighted sum of each element of a Variable w.r.t the leaf Variable. These weight is just the derivate of final loss w.r.t each element of the intermediate variable.

A concrete example

Let's take a concrete and simple example to understand this.

from torch.autograd import Variable
import torch
x = Variable(torch.FloatTensor([[1, 2, 3, 4]]), requires_grad=True)
z = 2*x
loss = z.sum(dim=1)

# do backward for first element of z
z.backward(torch.FloatTensor([[1, 0, 0, 0]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_() #remove gradient in x.grad, or it will be accumulated

# do backward for second element of z
z.backward(torch.FloatTensor([[0, 1, 0, 0]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_()

# do backward for all elements of z, with weight equal to the derivative of
# loss w.r.t z_1, z_2, z_3 and z_4
z.backward(torch.FloatTensor([[1, 1, 1, 1]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_()

# or we can directly backprop using loss
loss.backward() # equivalent to loss.backward(torch.FloatTensor([1.0]))
print(x.grad.data)

In the above example, the outcome of first print is

2 0 0 0
[torch.FloatTensor of size 1x4]

which is exactly the derivative of z_1 w.r.t to x.

The outcome of second print is :

0 2 0 0
[torch.FloatTensor of size 1x4]

which is the derivative of z_2 w.r.t to x.

Now if use a weight of [1, 1, 1, 1] to calculate the derivative of z w.r.t to x, the outcome is 1*dz_1/dx + 1*dz_2/dx + 1*dz_3/dx + 1*dz_4/dx. So no surprisingly, the output of 3rd print is:

2 2 2 2
[torch.FloatTensor of size 1x4]

It should be noted that weight vector [1, 1, 1, 1] is exactly derivative of loss w.r.t to z_1, z_2, z_3 and z_4. The derivative of loss w.r.t to x is calculated as:

d(loss)/dx = d(loss)/dz_1 * dz_1/dx + d(loss)/dz_2 * dz_2/dx + d(loss)/dz_3 * dz_3/dx + d(loss)/dz_4 * dz_4/dx

So the output of 4th print is the same as the 3rd print:

2 2 2 2
[torch.FloatTensor of size 1x4]

Typically, your computational graph has one scalar output says loss. Then you can compute the gradient of loss w.r.t. the weights (w) by loss.backward(). Where the default argument of backward() is 1.0.

If your output has multiple values (e.g. loss=[loss1, loss2, loss3]), you can compute the gradients of loss w.r.t. the weights by loss.backward(torch.FloatTensor([1.0, 1.0, 1.0])).

Furthermore, if you want to add weights or importances to different losses, you can use loss.backward(torch.FloatTensor([-0.1, 1.0, 0.0001])).

This means to calculate -0.1*d(loss1)/dw, d(loss2)/dw, 0.0001*d(loss3)/dw simultaneously.

Here, the output of forward(), i.e. y is a a 3-vector.

The three values are the gradients at the output of the network. They are usually set to 1.0 if y is the final output, but can have other values as well, especially if y is part of a bigger network.

For eg. if x is the input, y = [y1, y2, y3] is an intermediate output which is used to compute the final output z,

Then,

dz/dx = dz/dy1 * dy1/dx + dz/dy2 * dy2/dx + dz/dy3 * dy3/dx

So here, the three values to backward are

[dz/dy1, dz/dy2, dz/dy3]

and then backward() computes dz/dx

The original code I haven't found on PyTorch website anymore.

gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)
print(x.grad)

The problem with the code above is there is no function based on how to calculate the gradients. This means we don't know how many parameters (arguments the function takes) and the dimension of parameters.

To fully understand this I created an example close to the original:

Example 1:

a = torch.tensor([1.0, 2.0, 3.0], requires_grad = True)
b = torch.tensor([3.0, 4.0, 5.0], requires_grad = True)
c = torch.tensor([6.0, 7.0, 8.0], requires_grad = True)

y=3*a + 2*b*b + torch.log(c)    
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients,retain_graph=True)    

print(a.grad) # tensor([3.0000e-01, 3.0000e+00, 3.0000e-04])
print(b.grad) # tensor([1.2000e+00, 1.6000e+01, 2.0000e-03])
print(c.grad) # tensor([1.6667e-02, 1.4286e-01, 1.2500e-05])

I assumed our function is y=3*a + 2*b*b + torch.log(c) and the parameters are tensors with three elements inside.

You can think of the gradients = torch.FloatTensor([0.1, 1.0, 0.0001]) like this is the accumulator.

As you may hear, PyTorch autograd system calculation is equivalent to Jacobian product.

Jacobian

In case you have a function, like we did:

y=3*a + 2*b*b + torch.log(c)

Jacobian would be [3, 4*b, 1/c]. However, this Jacobian is not how PyTorch is doing things to calculate the gradients at a certain point.

PyTorch uses forward pass and backward mode automatic differentiation (AD) in tandem.

There is no symbolic math involved and no numerical differentiation.

Numerical differentiation would be to calculate δy/δb, for b=1 and b=1+ε where ε is small.

If you don't use gradients in y.backward():

Example 2

a = torch.tensor(0.1, requires_grad = True)
b = torch.tensor(1.0, requires_grad = True)
c = torch.tensor(0.1, requires_grad = True)
y=3*a + 2*b*b + torch.log(c)

y.backward()

print(a.grad) # tensor(3.)
print(b.grad) # tensor(4.)
print(c.grad) # tensor(10.)

You will simply get the result at a point, based on how you set your a, b, c tensors initially.

Be careful how you initialize your a, b, c:

Example 3:

a = torch.empty(1, requires_grad = True, pin_memory=True)
b = torch.empty(1, requires_grad = True, pin_memory=True)
c = torch.empty(1, requires_grad = True, pin_memory=True)

y=3*a + 2*b*b + torch.log(c)

gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)

print(a.grad) # tensor([3.3003])
print(b.grad) # tensor([0.])
print(c.grad) # tensor([inf])

If you use torch.empty() and don't use pin_memory=True you may have different results each time.

Also, note gradients are like accumulators so zero them when needed.

Example 4:

a = torch.tensor(1.0, requires_grad = True)
b = torch.tensor(1.0, requires_grad = True)
c = torch.tensor(1.0, requires_grad = True)
y=3*a + 2*b*b + torch.log(c)

y.backward(retain_graph=True)
y.backward()

print(a.grad) # tensor(6.)
print(b.grad) # tensor(8.)
print(c.grad) # tensor(2.)

Lastly few tips on terms PyTorch uses:

PyTorch creates a dynamic computational graph when calculating the gradients in forward pass. This looks much like a tree.

So you will often hear the leaves of this tree are input tensors and the root is output tensor.

Gradients are calculated by tracing the graph from the root to the leaf and multiplying every gradient in the way using the chain rule. This multiplying occurs in the backward pass.

Back some time I created PyTorch Automatic Differentiation tutorial that you may check interesting explaining all the tiny details about AD.

Related questions
                            
                                What's the difference between a bidirectional LSTM and an LSTM?
                            
                                How to tell Keras stop training based on loss value?
                            
                                How to assign a value to a TensorFlow variable?
                            
                                How to implement the ReLU function in Numpy
                            
                                pytorch - connection between loss.backward() and optimizer.step()
                            
                                keras: how to save the training history attribute of the history object
                            
                                How to choose cross-entropy loss in TensorFlow?
                            
                                How to fix RuntimeError "Expected object of scalar type Float but got scalar type Double for argument"?
                            
                                How to add regularizations in TensorFlow?
                            
                                What is the role of TimeDistributed layer in Keras?
                            
                                Common causes of nans during training
                            
                                NaN loss when training regression network
                            
                                Should we do learning rate decay for adam optimizer
                            
                                How to concatenate two layers in keras?
                            
                                multi-layer perceptron (MLP) architecture: criteria for choosing number of hidden layers and size of the hidden layer? [closed]
                            
                                What are some good resources for learning about Artificial Neural Networks? [closed]
                            
                                Why should weights of Neural Networks be initialized to random numbers? [closed]
                            
                                What's the difference between sparse_softmax_cross_entropy_with_logits and softmax_cross_entropy_with_logits?
                            
                                Tensorflow Strides Argument
                            
                                Loading a trained Keras model and continue training

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pytorch, what are the gradient arguments

Tags:

gradient

neural-network

pytorch

torch

gradient-descent

People also ask

Explanation

A concrete example

Recent Activity

Donate For Us