Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Backpropagation algorithm through cross-channel local response normalization (LRN) layer

I am working on replicating a neural network. I'm trying to get an understanding of how the standard layer types work. In particular, I'm having trouble finding a description anywhere of how cross-channel normalisation layers behave on the backward-pass.

Since the normalization layer has no parameters, I could guess two possible options:

  1. The error gradients from the next (i.e. later) layer are passed backwards without doing anything to them.

  2. The error gradients are normalized in the same way the activations are normalized across channels in the forward pass.

I can't think of a reason why you'd do one over the other based on any intuition, hence why I'd like some help on this.

EDIT1:

The layer is a standard layer in caffe, as described here http://caffe.berkeleyvision.org/tutorial/layers.html (see 'Local Response Normalization (LRN)').

The layer's implementation in the forward pass is described in section 3.3 of the alexNet paper: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

EDIT2:

I believe the forward and backward pass algorithms are described in both the Torch library here: https://github.com/soumith/cudnn.torch/blob/master/SpatialCrossMapLRN.lua

and in the Caffe library here: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/lrn_layer.cpp

Please could anyone who is familiar with either/both of these translate the method for the backward pass stage into plain english?

like image 220
user1488804 Avatar asked Nov 18 '15 14:11

user1488804


People also ask

What is LRN layer?

LRN is a non-trainable layer that square-normalizes the pixel values in a feature map within a local neighborhood. There are two types of LRN based on the neighborhood defined and can be seen in the figure below. Inter-Channel LRN: This is originally what the AlexNet paper used.

What is cross channel normalization layer?

Description. The cross-channel normalization operation uses local responses in different channels to normalize each activation. Cross-channel normalization typically follows a relu operation. Cross-channel normalization is also known as local response normalization.

Why do we normalize layers?

Layer normalization normalizes each of the inputs in the batch independently across all features. As batch normalization is dependent on batch size, it's not effective for small batch sizes. Layer normalization is independent of the batch size, so it can be applied to batches with smaller sizes as well.

What is local response?

Local Response meanss a hazardous materials emergency response in the local governmental area where team members normally conduct emergency response activities and those areas where the local government has a hazardous materials mutual response agreement in place and the responding team does not respond as a state team ...


1 Answers

It uses the chain rule to propagate the gradient backwards through the local response normalization layer. It is somewhat similar to a nonlinearity layer in this sense (which also doesn't have trainable parameters on its own, but does affect gradients going backwards).

From the code in Caffe that you linked to I see that they take the error in each neuron as a parameter, and compute the error for the previous layer by doing following:

First, on the forward pass they cache a so-called scale, that is computed (in terms of AlexNet paper, see the formula from section 3.3) as:

scale_i = k + alpha / n * sum(a_j ^ 2)

Here and below sum is sum indexed by j and goes from max(0, i - n/2) to min(N, i + n/2)

(note that in the paper they do not normalize by n, so I assume this is something that Caffe does differently than AlexNet). Forward pass is then computed as b_i = a_i + scale_i ^ -beta.

To backward propagate the error, let's say that the error coming from the next layer is be_i, and the error that we need to compute is ae_i. Then ae_i is computed as:

ae_i = scale_i ^ -b * be_i - (2 * alpha * beta / n) * a_i * sum(be_j * b_j / scale_j)

Since you are planning to implement it manually, I will also share two tricks that Caffe uses in their code that makes the implementation simpler:

  1. When you compute the addends for the sum, allocate an array of size N + n - 1, and pad it with n/2 zeros on each end. This way you can compute the sum from i - n/2 to i + n/2, without caring about going below zero and beyond N.

  2. You don't need to recompute the sum on each iteration, instead compute the the addends in advance (a_j^2 for the front pass, be_j * b_j / scale_j for the backward pass), then compute the sum for i = 0, and then for each consecutive i just add addend[i + n/2] and subtract addend[i - n/2 - 1], it will give you the value of the sum for the new value of i in constant time.

like image 73
Ishamael Avatar answered Sep 20 '22 14:09

Ishamael