Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ordering of batch normalization and dropout?

People also ask

Should dropout be before batch normalization?

Example - Using Dropout and Batch Normalization Now we'll increase the capacity even more, but add dropout to control overfitting and batch normalization to speed up optimization. This time, we'll also leave off standardizing the data, to demonstrate how batch normalization can stabalize the training.

Where should I put batch normalization and dropout?

Batch Normalization layer can be used several times in a CNN network and is dependent on the programmer whereas multiple dropouts layers can also be placed between different layers but it is also reliable to add them after dense layers.

Should dropout and batch normalization be used together?

It seems to suggest not to use them together at all("explains the disharmony between Dropout and Batch Norm(BN)"). This is the answer for the question. Dropout changes the "standard deviation" of the distribution during training, but doesn't change the distribution during validation.

What is the role of batch normalization and drop out in neural networks?

BN normalizes values of the units for each batch with its own mean and standard deviation. Dropout, on the other hand, randomly drops a predefined ratio of units in a neural network to prevent overfitting.


In the Ioffe and Szegedy 2015, the authors state that "we would like to ensure that for any parameter values, the network always produces activations with the desired distribution". So the Batch Normalization Layer is actually inserted right after a Conv Layer/Fully Connected Layer, but before feeding into ReLu (or any other kinds of) activation. See this video at around time 53 min for more details.

As far as dropout goes, I believe dropout is applied after activation layer. In the dropout paper figure 3b, the dropout factor/probability matrix r(l) for hidden layer l is applied to it on y(l), where y(l) is the result after applying activation function f.

So in summary, the order of using batch normalization and dropout is:

-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC ->


As noted in the comments, an amazing resource to read up on the order of layers is here. I have gone through the comments and it is the best resource on topic i have found on internet

My 2 cents:

Dropout is meant to block information from certain neurons completely to make sure the neurons do not co-adapt. So, the batch normalization has to be after dropout otherwise you are passing information through normalization statistics.

If you think about it, in typical ML problems, this is the reason we don't compute mean and standard deviation over entire data and then split it into train, test and validation sets. We split and then compute the statistics over the train set and use them to normalize and center the validation and test datasets

so i suggest Scheme 1 (This takes pseudomarvin's comment on accepted answer into consideration)

-> CONV/FC -> ReLu(or other activation) -> Dropout -> BatchNorm -> CONV/FC

as opposed to Scheme 2

-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC -> in the accepted answer

Please note that this means that the network under Scheme 2 should show over-fitting as compared to network under Scheme 1 but OP ran some tests as mentioned in question and they support Scheme 2


Usually, Just drop the Dropout(when you have BN):

  • "BN eliminates the need for Dropout in some cases cause BN provides similar regularization benefits as Dropout intuitively"
  • "Architectures like ResNet, DenseNet, etc. not using Dropout

For more details, refer to this paper [Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift] as already mentioned by @Haramoz in the comments.


Conv - Activation - DropOut - BatchNorm - Pool --> Test_loss: 0.04261355847120285

Conv - Activation - DropOut - Pool - BatchNorm --> Test_loss: 0.050065308809280396

Conv - Activation - BatchNorm - Pool - DropOut --> Test_loss: 0.04911309853196144

Conv - Activation - BatchNorm - DropOut - Pool --> Test_loss: 0.06809622049331665

Conv - BatchNorm - Activation - DropOut - Pool --> Test_loss: 0.038886815309524536

Conv - BatchNorm - Activation - Pool - DropOut --> Test_loss: 0.04126095026731491

Conv - BatchNorm - DropOut - Activation - Pool --> Test_loss: 0.05142546817660332

Conv - DropOut - Activation - BatchNorm - Pool --> Test_loss: 0.04827788099646568

Conv - DropOut - Activation - Pool - BatchNorm --> Test_loss: 0.04722036048769951

Conv - DropOut - BatchNorm - Activation - Pool --> Test_loss: 0.03238215297460556


Trained on the MNIST dataset (20 epochs) with 2 convolutional modules (see below), followed each time with

model.add(Flatten())
model.add(layers.Dense(512, activation="elu"))
model.add(layers.Dense(10, activation="softmax"))

The Convolutional layers have a kernel size of (3,3), default padding, the activation is elu. The Pooling is a MaxPooling of the poolside (2,2). Loss is categorical_crossentropy and the optimizer is adam.

The corresponding Dropout probability is 0.2 or 0.3, respectively. The amount of feature maps is 32 or 64, respectively.

Edit: When I dropped the Dropout, as recommended in some answers, it converged faster but had a worse generalization ability than when I use BatchNorm and Dropout.


I found a paper that explains the disharmony between Dropout and Batch Norm(BN). The key idea is what they call the "variance shift". This is due to the fact that dropout has a different behavior between training and testing phases, which shifts the input statistics that BN learns. The main idea can be found in this figure which is taken from this paper. enter image description here

A small demo for this effect can be found in this notebook.


I read the recommended papers in the answer and comments from https://stackoverflow.com/a/40295999/8625228

From Ioffe and Szegedy (2015)’s point of view, only use BN in the network structure. Li et al. (2018) give the statistical and experimental analyses, that there is a variance shift when the practitioners use Dropout before BN. Thus, Li et al. (2018) recommend applying Dropout after all BN layers.

From Ioffe and Szegedy (2015)’s point of view, BN is located inside/before the activation function. However, Chen et al. (2019) use an IC layer which combines dropout and BN, and Chen et al. (2019) recommends use BN after ReLU.

On the safety background, I use Dropout or BN only in the network.

Chen, Guangyong, Pengfei Chen, Yujun Shi, Chang-Yu Hsieh, Benben Liao, and Shengyu Zhang. 2019. “Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks.” CoRR abs/1905.05928. http://arxiv.org/abs/1905.05928.

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” CoRR abs/1502.03167. http://arxiv.org/abs/1502.03167.

Li, Xiang, Shuo Chen, Xiaolin Hu, and Jian Yang. 2018. “Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift.” CoRR abs/1801.05134. http://arxiv.org/abs/1801.05134.