Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the backward process of max operation in deep learning?

I know that the backward process of deep learning follows the gradient descent algorithm. However, there is never a gradient concept for max operation.

How does deep learning frameworks like tensorflow, pytorch deal with the backward of 'max' operation like maxpooling?

like image 454
Ink Avatar asked Nov 29 '18 12:11

Ink


People also ask

What is the difference between Backpropagation and gradient descent?

Back-propagation is the process of calculating the derivatives and gradient descent is the process of descending through the gradient, i.e. adjusting the parameters of the model to go down through the loss function.

Does torch Max have gradient?

However, there is never a gradient concept for max operation. How does deep learning frameworks like tensorflow, pytorch deal with the backward of 'max' operation like maxpooling ?


1 Answers

You have to think of what the max operator actually does? That is:

  • It returns or lets better say it propagates the maximum.

And that's exactly what it does here - it takes two or more tensors and propagates forward (only) the maximum.

It is often helpful to take a look at a short example:

t1 = torch.rand(10, requires_grad=True)
t2 = torch.rand(10, requires_grad=True)


s1 = torch.sum(t1)
s2 = torch.sum(t2)
print('sum t1:', s1, 'sum t2:', s2)
m = torch.max(s1, s2)
print('max:', m, 'requires_grad:', m.requires_grad)
m.backward()
print('t1 gradients:', t1.grad)
print('t2 gradients:', t2.grad)

This code creates two random tensors sums them up and puts them through a max function. Then backward() is called upon the result.

Lets take a look at the two possible outcomes:

  • Outcome 1 - sum of t1 is larger:

    sum t1: tensor(5.6345) sum t2: tensor(4.3965)
    max: tensor(5.6345) requires_grad: True
    t1 gradients: tensor([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])
    t2 gradients: tensor([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
    
  • Outcome 2 - sum of t2 is larger:

    sum t1: tensor(3.3263) sum t2: tensor(4.0517)
    max: tensor(4.0517) requires_grad: True
    t1 gradients: tensor([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
    t2 gradients: tensor([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])
    

As you would expect in the case s1 represents the maximum gradients will be calculated for t1. Similarly when s2 is the maximum gradients will be calculated for t2.

  • Similarly like in the forward step, back propagation is propagating backwards through the maximum.

One thing worth mentioning is that the other tensors which do not represent the maximum are still part of the graph. Only the gradients are set to zero then. If they wouldn't be part of the graph you would get None as gradient, instead of a zero vector.

You can check what happens if you use python-max instead of torch.max:

t1 = torch.rand(10, requires_grad=True)
t2 = torch.rand(10, requires_grad=True)


s1 = torch.sum(t1)
s2 = torch.sum(t2)
print('sum t1:', s1, 'sum t2:', s2)
m = max(s1, s2)
print('max:', m, 'requires_grad:', m.requires_grad)
m.backward()
print('t1 gradients:', t1.grad)
print('t2 gradients:', t2.grad)

Output:

sum t1: tensor(4.7661) sum t2: tensor(4.4166)
max: tensor(4.7661) requires_grad: True
t1 gradients: tensor([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])
t2 gradients: None 
like image 113
MBT Avatar answered Oct 21 '22 12:10

MBT