Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using PyTorch's autograd efficiently with tensors by calculating the Jacobian

Tags:

python

pytorch

In my previous question I found how to use PyTorch's autograd with tensors:

import torch
from torch.autograd import grad
import torch.nn as nn
import torch.optim as optim

class net_x(nn.Module): 
        def __init__(self):
            super(net_x, self).__init__()
            self.fc1=nn.Linear(1, 20) 
            self.fc2=nn.Linear(20, 20)
            self.out=nn.Linear(20, 4) #a,b,c,d

        def forward(self, x):
            x=torch.tanh(self.fc1(x))
            x=torch.tanh(self.fc2(x))
            x=self.out(x)
            return x

nx = net_x()

#input
t = torch.tensor([1.0, 2.0, 3.2], requires_grad = True) #input vector
t = torch.reshape(t, (3,1)) #reshape for batch

#method 
dx = torch.autograd.functional.jacobian(lambda t_: nx(t_), t)
dx = torch.diagonal(torch.diagonal(dx, 0, -1), 0)[0] #first vector
#dx = torch.diagonal(torch.diagonal(dx, 1, -1), 0)[0] #2nd vector
#dx = torch.diagonal(torch.diagonal(dx, 2, -1), 0)[0] #3rd vector
#dx = torch.diagonal(torch.diagonal(dx, 3, -1), 0)[0] #4th vector
dx 
>>> 
tensor([-0.0142, -0.0517, -0.0634])

The issue is that grad only knows how to propagate gradients from a scalar tensor (which my network's output is not), which is why I had to calculate the Jacobian.

However, this is not very efficient and a bit slow as my matrix is large and calculating the entire Jacobian takes a while (and I'm also not using the entire Jacobian matrix).

Is there a way to calculate only the diagonals of the Jacobian (to get the 4 vectors in this example)?

There appears to be an open feature request but it doesn't appear to have gotten much attention.

Update 1:
I tried what @iacob said about setting torch.autograd.functional.jacobian(vectorize=True).
However, this seems to be slower. To test this I changed my network output from 4 to 400, and my input t to be:

val = 100
t = torch.rand(val, requires_grad = True) #input vector
t = torch.reshape(t, (val,1)) #reshape for batch

Without vectorized = True:

Wall time: 10.4 s

With:

Wall time: 14.6 s
like image 912
Penguin Avatar asked May 10 '21 14:05

Penguin


2 Answers

OK, results first:

The performance (My laptop has an RTX-2070 and PyTorch is using it):

# Method 1: Use the jacobian function
CPU times: user 34.6 s, sys: 165 ms, total: 34.7 s
Wall time: 5.8 s

# Method 2: Sample with appropriate vectors
CPU times: user 1.11 ms, sys: 0 ns, total: 1.11 ms
Wall time: 191 µs

It's about 30000x faster.


Why should you use backward instead of jacobian (in your case)

I'm not a pro with PyTorch. But, according to my experience, it's pretty inefficient to calculate the jacobi-matrix if you do not need all the elements in it.

If you only need the diagonal elements, you can use backward function to calculate vector-jacobian multiplication with some specific vectors. If you set the vectors correctly, you can sample/extract specific elements from the Jacobi matrix.

A little linear algebra:

j = np.array([[1,2],[3,4]]) # 2x2 jacobi you want 
sv = np.array([[1],[0]])     # 2x1 sampling vector

first_diagonal_element = sv.T.dot(j).dot(sv)  # it's j[0, 0]

It's not that powerful for this simple case. But if PyTorch needs to calculate all jacobians along the way (j could be the result of a long sequence of matrix-matrix multiplications), it would be way too slow. In contrast, if we calculate a sequence of vector-jacobian multiplications, the computation would be super fast.


Solution

Sample elements from jacobian:

import torch
from torch.autograd import grad
import torch.nn as nn
import torch.optim as optim

class net_x(nn.Module): 
        def __init__(self):
            super(net_x, self).__init__()
            self.fc1=nn.Linear(1, 20) 
            self.fc2=nn.Linear(20, 20)
            self.out=nn.Linear(20, 400) #a,b,c,d

        def forward(self, x):
            x=torch.tanh(self.fc1(x))
            x=torch.tanh(self.fc2(x))
            x=self.out(x)
            return x

nx = net_x()

#input

val = 100
a = torch.rand(val, requires_grad = True) #input vector
t = torch.reshape(a, (val,1)) #reshape for batch


#method 
%time dx = torch.autograd.functional.jacobian(lambda t_: nx(t_), t)
dx = torch.diagonal(torch.diagonal(dx, 0, -1), 0)[0] #first vector
#dx = torch.diagonal(torch.diagonal(dx, 1, -1), 0)[0] #2nd vector
#dx = torch.diagonal(torch.diagonal(dx, 2, -1), 0)[0] #3rd vector
#dx = torch.diagonal(torch.diagonal(dx, 3, -1), 0)[0] #4th vector
print(dx)


out = nx(t)
m = torch.zeros((val,400))
m[:, 0] = 1
%time out.backward(m)
print(a.grad)

a.grad should be equal to the first tensor dx. And, m is the sampling vector that corresponds to what is called the "first vector" in your code.


  1. but if I run it again the answer will change.

Yeah, you've already figured it out. The gradients accumulate every time you call backward. So you have to set a.grad to zero first if you have to run that cell multiple times.

  1. can you explain the idea behind the m method? Both using the torch.zeros and setting the column to 1. Also, how come the grad is on a rather than on t?
  • The idea behind the m method is: what the function backward calculates is actually a vector-jacobian multiplication, where the vector represents the so-called "upstream gradient" and the Jacobi-matrix is the "local gradient" (and this jacobian is also the one you get with the jacobian function, since your lambda could be viewed as a single "local" operation). If you need some elements from the jacobian, you can fake (or, more precisely, construct) some "upstream gradient", with which you can extract specific entries from the jacobian. However, sometimes these upstream gradients might be hard to find (for me at least) if complicated tensor operations are involved.
  • PyTorch accumulates gradients on leaf nodes of the computational graph. And, your original line of code t = torch.reshape(t, (3,1)) loses handle to the leaf node, and t refers now to an intermediate node instead of a leaf node. In order to have access to the leaf node, I created the tensor a.
like image 174
Shihao Xu Avatar answered Nov 14 '22 20:11

Shihao Xu


Have you tried setting torch.autograd.functional.jacobian(vectorize=True) ?

vectorize (bool, optional) – This feature is experimental, please use at your own risk. When computing the jacobian, usually we invoke autograd.grad once per row of the jacobian. If this flag is True, we use the vmap prototype feature as the backend to vectorize calls to autograd.grad so we only invoke it once instead of once per row. This should lead to performance improvements in many use cases, however, due to this feature being incomplete, there may be performance cliffs. Please use torch._C._debug_only_display_vmap_fallback_warnings(True) to show any performance warnings and file us issues if warnings exist for your use case. Defaults to False.

like image 26
iacob Avatar answered Nov 14 '22 21:11

iacob