I know that the backward process of deep learning follows the gradient descent algorithm. However, there is never a gradient concept for max
operation.
How does deep learning frameworks like tensorflow, pytorch deal with the backward of 'max' operation like maxpooling
?
Back-propagation is the process of calculating the derivatives and gradient descent is the process of descending through the gradient, i.e. adjusting the parameters of the model to go down through the loss function.
However, there is never a gradient concept for max operation. How does deep learning frameworks like tensorflow, pytorch deal with the backward of 'max' operation like maxpooling ?
You have to think of what the max
operator actually does? That is:
And that's exactly what it does here - it takes two or more tensors and propagates forward (only) the maximum.
It is often helpful to take a look at a short example:
t1 = torch.rand(10, requires_grad=True)
t2 = torch.rand(10, requires_grad=True)
s1 = torch.sum(t1)
s2 = torch.sum(t2)
print('sum t1:', s1, 'sum t2:', s2)
m = torch.max(s1, s2)
print('max:', m, 'requires_grad:', m.requires_grad)
m.backward()
print('t1 gradients:', t1.grad)
print('t2 gradients:', t2.grad)
This code creates two random tensors sums them up and puts them through a max function. Then backward()
is called upon the result.
Lets take a look at the two possible outcomes:
Outcome 1 - sum of t1
is larger:
sum t1: tensor(5.6345) sum t2: tensor(4.3965)
max: tensor(5.6345) requires_grad: True
t1 gradients: tensor([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
t2 gradients: tensor([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
Outcome 2 - sum of t2
is larger:
sum t1: tensor(3.3263) sum t2: tensor(4.0517)
max: tensor(4.0517) requires_grad: True
t1 gradients: tensor([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
t2 gradients: tensor([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
As you would expect in the case s1
represents the maximum gradients will be calculated for t1
. Similarly when s2
is the maximum gradients will be calculated for t2
.
One thing worth mentioning is that the other tensors which do not represent the maximum are still part of the graph. Only the gradients are set to zero then. If they wouldn't be part of the graph you would get None
as gradient, instead of a zero vector.
You can check what happens if you use python-max
instead of torch.max
:
t1 = torch.rand(10, requires_grad=True)
t2 = torch.rand(10, requires_grad=True)
s1 = torch.sum(t1)
s2 = torch.sum(t2)
print('sum t1:', s1, 'sum t2:', s2)
m = max(s1, s2)
print('max:', m, 'requires_grad:', m.requires_grad)
m.backward()
print('t1 gradients:', t1.grad)
print('t2 gradients:', t2.grad)
Output:
sum t1: tensor(4.7661) sum t2: tensor(4.4166)
max: tensor(4.7661) requires_grad: True
t1 gradients: tensor([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
t2 gradients: None
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With