I have three simple questions.
requires_grad = True
for that variable? or it doesn't matter? If it doesn't matter, then why?forward
function. Which approach is preferable, writing a function or a layer? Why?I need a clear and nice explanation to these questions to resolve my confusions. Please help.
Let me have a go.
This depends on what you mean by "non-differentiable". The first definition that makes sense here is that PyTorch doesn't know how to compute gradients. If you try to compute gradients nevertheless, this will raise an error. The two possible scenarios are:
a) You're using a custom PyTorch operation for which gradients have not been implemented, e.g. torch.svd()
. In that case you will get a TypeError
:
import torch
from torch.autograd import Function
from torch.autograd import Variable
A = Variable(torch.randn(10,10), requires_grad=True)
u, s, v = torch.svd(A) # raises TypeError
b) You have implemented your own operation, but did not define backward()
. In this case, you will get a NotImplementedError
:
class my_function(Function): # forgot to define backward()
def forward(self, x):
return 2 * x
A = Variable(torch.randn(10,10))
B = my_function()(A)
C = torch.sum(B)
C.backward() # will raise NotImplementedError
The second definition that makes sense is "mathematically non-differentiable". Clearly, an operation which is mathematically not differentiable should either not have a backward()
method implemented or a sensible sub-gradient. Consider for example torch.abs()
whose backward()
method returns the subgradient 0 at 0:
A = Variable(torch.Tensor([-1,0,1]),requires_grad=True)
B = torch.abs(A)
B.backward(torch.Tensor([1,1,1]))
A.grad.data
For these cases, you should refer to the PyTorch documentation directly and dig out the backward()
method of the respective operation directly.
It doesn't matter. The use of requires_grad
is to avoid unnecessary computations of gradients for subgraphs. If there’s a single input to an operation that requires gradient, its output will also require gradient. Conversely, only if all inputs don’t require gradient, the output also won’t require it. Backward computation is never performed in the subgraphs, where all Variables didn’t require gradients.
Since, there are most likely some Variables
(for example parameters of a subclass of nn.Module()
), your loss
Variable will also require gradients automatically. However, you should notice that exactly for how requires_grad
works (see above again), you can only change requires_grad
for leaf variables of your graph anyway.
All the custom PyTorch loss functions, are subclasses of _Loss
which is a subclass of nn.Module
. See here. If you'd like to stick to this convention, you should subclass _Loss
when defining your custom loss function. Apart from consistency, one advantage is that your subclass will raise an AssertionError
, if you haven't marked your target variables as volatile
or requires_grad = False
. Another advantage is that you can nest your loss function in nn.Sequential()
, because its a nn.Module
I would recommend this approach for these reasons.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With