Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Custom loss function in PyTorch

Tags:

pytorch

I have three simple questions.

  1. What will happen if my custom loss function is not differentiable? Will pytorch through error or do something else?
  2. If I declare a loss variable in my custom function which will represent the final loss of the model, should I put requires_grad = True for that variable? or it doesn't matter? If it doesn't matter, then why?
  3. I have seen people sometimes write a separate layer and compute the loss in the forward function. Which approach is preferable, writing a function or a layer? Why?

I need a clear and nice explanation to these questions to resolve my confusions. Please help.

like image 466
Wasi Ahmad Avatar asked Jun 16 '17 20:06

Wasi Ahmad


1 Answers

Let me have a go.

  1. This depends on what you mean by "non-differentiable". The first definition that makes sense here is that PyTorch doesn't know how to compute gradients. If you try to compute gradients nevertheless, this will raise an error. The two possible scenarios are:

    a) You're using a custom PyTorch operation for which gradients have not been implemented, e.g. torch.svd(). In that case you will get a TypeError:

    import torch
    from torch.autograd import Function
    from torch.autograd import Variable
    
    A = Variable(torch.randn(10,10), requires_grad=True)
    u, s, v = torch.svd(A) # raises TypeError
    

    b) You have implemented your own operation, but did not define backward(). In this case, you will get a NotImplementedError:

    class my_function(Function): # forgot to define backward()
    
        def forward(self, x):
            return 2 * x
    
    A = Variable(torch.randn(10,10))
    B = my_function()(A)
    C = torch.sum(B)
    C.backward() # will raise NotImplementedError
    

    The second definition that makes sense is "mathematically non-differentiable". Clearly, an operation which is mathematically not differentiable should either not have a backward() method implemented or a sensible sub-gradient. Consider for example torch.abs() whose backward() method returns the subgradient 0 at 0:

    A = Variable(torch.Tensor([-1,0,1]),requires_grad=True)
    B = torch.abs(A)
    B.backward(torch.Tensor([1,1,1]))
    A.grad.data
    

    For these cases, you should refer to the PyTorch documentation directly and dig out the backward() method of the respective operation directly.

  2. It doesn't matter. The use of requires_gradis to avoid unnecessary computations of gradients for subgraphs. If there’s a single input to an operation that requires gradient, its output will also require gradient. Conversely, only if all inputs don’t require gradient, the output also won’t require it. Backward computation is never performed in the subgraphs, where all Variables didn’t require gradients.

    Since, there are most likely some Variables (for example parameters of a subclass of nn.Module()), your loss Variable will also require gradients automatically. However, you should notice that exactly for how requires_grad works (see above again), you can only change requires_grad for leaf variables of your graph anyway.

  3. All the custom PyTorch loss functions, are subclasses of _Loss which is a subclass of nn.Module. See here. If you'd like to stick to this convention, you should subclass _Loss when defining your custom loss function. Apart from consistency, one advantage is that your subclass will raise an AssertionError, if you haven't marked your target variables as volatile or requires_grad = False. Another advantage is that you can nest your loss function in nn.Sequential(), because its a nn.Module I would recommend this approach for these reasons.

like image 180
mbpaulus Avatar answered Oct 21 '22 04:10

mbpaulus