Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

batch normalization in neural network

I'm still fairly new with ANN and I was just reading the Batch Normalization paper (http://arxiv.org/pdf/1502.03167.pdf), but I'm not sure I'm getting what they are doing (and more importantly, why it works)

So let's say I have two layers L1 and L2, where L1 produces outputs and sends them to the neurons in L2. Batch normalization just takes all the outputs from L1 (i.e. every single output from every single neuron, getting an overall vector of |L1| X |L2| numbers for a fully connected network), normalizes them to have a mean of 0 and SD of 1, and then feeds them to their respective neurons in L2 (plus applying the linear transformation of gamma and beta they were discussing in the paper)?

If this is indeed the case, how is this helping the NN? what's so special about a constant distribution?

like image 637
WhiteTiger Avatar asked Apr 30 '15 23:04

WhiteTiger


People also ask

What is batch normalization in convolutional neural network?

Batch Norm is a normalization technique done between the layers of a Neural Network instead of in the raw data. It is done along mini-batches instead of the full data set. It serves to speed up training and use higher learning rates, making learning easier.

How does batch normalization work?

How Does Batch Norm work? Batch Norm is just another network layer that gets inserted between a hidden layer and the next hidden layer. Its job is to take the outputs from the first hidden layer and normalize them before passing them on as the input of the next hidden layer. Just like the parameters (eg.

When should I use batch normalization?

When to use Batch Normalization? We can use Batch Normalization in Convolution Neural Networks, Recurrent Neural Networks, and Artificial Neural Networks. In practical coding, we add Batch Normalization after the activation function of the output layer or before the activation function of the input layer.

What is normalization in neural network?

Normalization can help training of our neural networks as the different features are on a similar scale, which helps to stabilize the gradient descent step, allowing us to use larger learning rates or help models converge faster for a given learning rate.


1 Answers

During standard SGD training of a network, the distribution of inputs to a hidden layer will change because the hidden layer before it is constantly changing as well. This is known as covariate shift and can be a problem; see, for instance, here.

It is known that neural networks converge faster if the training data is "whitened", that is, transformed in such a way that each component has a Gaussian distribution and is independent of the other components. See the papers (LeCun et al., 1998b) and (Wiesler & Ney, 2011) cited in the paper.

The idea of the authors is now to apply this whitening not only to the input layer, but to the input of every intermediate layer as well. It would be too expensive to do this over the entire input dataset, so instead they do it batch-wise. They claim that this can vastly speed up the training process and also acts as a sort of regularization.

like image 150
cfh Avatar answered Sep 23 '22 08:09

cfh