In my previous question I found how to use PyTorch's autograd with tensors:
import torch
from torch.autograd import grad
import torch.nn as nn
import torch.optim as optim
class net_x(nn.Module):
def __init__(self):
super(net_x, self).__init__()
self.fc1=nn.Linear(1, 20)
self.fc2=nn.Linear(20, 20)
self.out=nn.Linear(20, 4) #a,b,c,d
def forward(self, x):
x=torch.tanh(self.fc1(x))
x=torch.tanh(self.fc2(x))
x=self.out(x)
return x
nx = net_x()
#input
t = torch.tensor([1.0, 2.0, 3.2], requires_grad = True) #input vector
t = torch.reshape(t, (3,1)) #reshape for batch
#method
dx = torch.autograd.functional.jacobian(lambda t_: nx(t_), t)
dx = torch.diagonal(torch.diagonal(dx, 0, -1), 0)[0] #first vector
#dx = torch.diagonal(torch.diagonal(dx, 1, -1), 0)[0] #2nd vector
#dx = torch.diagonal(torch.diagonal(dx, 2, -1), 0)[0] #3rd vector
#dx = torch.diagonal(torch.diagonal(dx, 3, -1), 0)[0] #4th vector
dx
>>>
tensor([-0.0142, -0.0517, -0.0634])
The issue is that grad
only knows how to propagate gradients from a scalar tensor (which my network's output is not), which is why I had to calculate the Jacobian.
However, this is not very efficient and a bit slow as my matrix is large and calculating the entire Jacobian takes a while (and I'm also not using the entire Jacobian matrix).
Is there a way to calculate only the diagonals of the Jacobian (to get the 4 vectors in this example)?
There appears to be an open feature request but it doesn't appear to have gotten much attention.
Update 1:
I tried what @iacob said about setting torch.autograd.functional.jacobian(vectorize=True)
.
However, this seems to be slower. To test this I changed my network output from 4
to 400
, and my input t
to be:
val = 100
t = torch.rand(val, requires_grad = True) #input vector
t = torch.reshape(t, (val,1)) #reshape for batch
Without vectorized = True
:
Wall time: 10.4 s
With:
Wall time: 14.6 s
OK, results first:
The performance (My laptop has an RTX-2070 and PyTorch is using it):
# Method 1: Use the jacobian function
CPU times: user 34.6 s, sys: 165 ms, total: 34.7 s
Wall time: 5.8 s
# Method 2: Sample with appropriate vectors
CPU times: user 1.11 ms, sys: 0 ns, total: 1.11 ms
Wall time: 191 µs
It's about 30000x faster.
backward
instead of jacobian
(in your case)I'm not a pro with PyTorch. But, according to my experience, it's pretty inefficient to calculate the jacobi-matrix if you do not need all the elements in it.
If you only need the diagonal elements, you can use backward
function to calculate vector-jacobian multiplication with some specific vectors. If you set the vectors correctly, you can sample/extract specific elements from the Jacobi matrix.
A little linear algebra:
j = np.array([[1,2],[3,4]]) # 2x2 jacobi you want
sv = np.array([[1],[0]]) # 2x1 sampling vector
first_diagonal_element = sv.T.dot(j).dot(sv) # it's j[0, 0]
It's not that powerful for this simple case. But if PyTorch needs to calculate all jacobians along the way (j
could be the result of a long sequence of matrix-matrix multiplications), it would be way too slow. In contrast, if we calculate a sequence of vector-jacobian multiplications, the computation would be super fast.
Sample elements from jacobian:
import torch
from torch.autograd import grad
import torch.nn as nn
import torch.optim as optim
class net_x(nn.Module):
def __init__(self):
super(net_x, self).__init__()
self.fc1=nn.Linear(1, 20)
self.fc2=nn.Linear(20, 20)
self.out=nn.Linear(20, 400) #a,b,c,d
def forward(self, x):
x=torch.tanh(self.fc1(x))
x=torch.tanh(self.fc2(x))
x=self.out(x)
return x
nx = net_x()
#input
val = 100
a = torch.rand(val, requires_grad = True) #input vector
t = torch.reshape(a, (val,1)) #reshape for batch
#method
%time dx = torch.autograd.functional.jacobian(lambda t_: nx(t_), t)
dx = torch.diagonal(torch.diagonal(dx, 0, -1), 0)[0] #first vector
#dx = torch.diagonal(torch.diagonal(dx, 1, -1), 0)[0] #2nd vector
#dx = torch.diagonal(torch.diagonal(dx, 2, -1), 0)[0] #3rd vector
#dx = torch.diagonal(torch.diagonal(dx, 3, -1), 0)[0] #4th vector
print(dx)
out = nx(t)
m = torch.zeros((val,400))
m[:, 0] = 1
%time out.backward(m)
print(a.grad)
a.grad
should be equal to the first tensor dx
. And, m
is the sampling vector that corresponds to what is called the "first vector" in your code.
- but if I run it again the answer will change.
Yeah, you've already figured it out. The gradients accumulate every time you call backward
. So you have to set a.grad
to zero first if you have to run that cell multiple times.
- can you explain the idea behind the
m
method? Both using thetorch.zeros
and setting the column to1
. Also, how come the grad is ona
rather than ont
?
m
method is: what the function backward
calculates is actually a vector-jacobian multiplication, where the vector represents the so-called "upstream gradient" and the Jacobi-matrix is the "local gradient" (and this jacobian is also the one you get with the jacobian
function, since your lambda
could be viewed as a single "local" operation). If you need some elements from the jacobian, you can fake (or, more precisely, construct) some "upstream gradient", with which you can extract specific entries from the jacobian. However, sometimes these upstream gradients might be hard to find (for me at least) if complicated tensor operations are involved.t = torch.reshape(t, (3,1))
loses handle to the leaf node, and t
refers now to an intermediate node instead of a leaf node. In order to have access to the leaf node, I created the tensor a
.Have you tried setting torch.autograd.functional.jacobian(vectorize=True)
?
vectorize (bool, optional) – This feature is experimental, please use at your own risk. When computing the jacobian, usually we invoke
autograd.grad
once per row of the jacobian. If this flag isTrue
, we use thevmap
prototype feature as the backend to vectorize calls toautograd.grad
so we only invoke it once instead of once per row. This should lead to performance improvements in many use cases, however, due to this feature being incomplete, there may be performance cliffs. Please usetorch._C._debug_only_display_vmap_fallback_warnings(True)
to show any performance warnings and file us issues if warnings exist for your use case. Defaults toFalse
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With