For a CNN architecture I want to use SpatialDropout2D layer instead of Dropout layer. Additionaly I want to use BatchNormalization. So far I had always set the BatchNormalization directly after a Convolutional layer but before the activation function, as in the paper by Ioffe and Szegedy mentioned. The dropout layers I had always set after MaxPooling2D layer.
In https://machinelearningmastery.com/how-to-reduce-overfitting-with-dropout-regularization-in-keras/ SpatialDropout2D is set directly after Convolutional layer.
I find it rather confusing in which order I should now apply these layers. I had also read on a Keras page that SpatialDropout should be placed directly behind ConvLayer (but I can't find this page anymore).
Is the following order correct?
ConvLayer - SpatialDropout - BatchNormalization - Activation function - MaxPooling
I really hope for tips and thank you in advance
Update My goal was actually to exchange in the following CNN architecture dropout for spatial dropout:
model = Sequential()
model.add(Conv2D(32,(3,3))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Conv2D(32,(3,3))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2))
model.add(Dropout(0.2))
model.add(Conv2D(64, (3,3))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Conv2D(64,(3,3))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(512))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.4))
model.add(Dense(10))
model.add(Activation('softmax'))
In practical coding, we add Batch Normalization after the activation function of the output layer or before the activation function of the input layer. Mostly researchers found good results in implementing Batch Normalization after the activation layer.
Typically, dropout is applied after the non-linear activation function (a). However, when using rectified linear units (ReLUs), it might make sense to apply dropout before the non-linear activation (b) for reasons of computational efficiency depending on the particular code implementation.
In Andrew Ng's Coursera course, he recommends performing batch-norm before ReLu which is the popular practice. I don't see why its not better after. Technically batch-norm can normalize to any mean and variance so it shouldn't matter, but isn't it easier to normalize after as we want activations to have variance 1?
We can add batch normalization into our model by adding it in the same way as adding Dense layer. ]); BatchNormalization() normalize the activation of the previous layer at each batch and by default, it is using the following values [3]: Momentum defaults to 0.99.
There is a big problem that appears when you mix these layers, especially when BatchNormalization
is right after Dropout
.
Dropouts try to keep the same mean of the outputs without dropouts, but it does change the standard deviation, which will cause a huge difference in the BatchNormalization between training and validation. (During training, the BatchNormalization
receives changed standard deviations, accumulates and stores them. During validation, the dropouts are turned off, the standard deviation is not a changed one anymore, but the original. But BatchNormalization
, because it's in validation, will not use the batch statistics, but the stored statistics, which will be very different from the batch statistics)
So, the first and most important rule is: don't place a BatchNormalization
after a Dropout
(or a SpatialDropout
).
Usually, I try to leave at least two convolutional/dense layers without any dropout before applying a batch normalization, to avoid this.
Also important: the role of the Dropout
is to "zero" the influence of some of the weights of the next layer. If you apply a normalization after the dropout, you will not have "zeros" anymore, but a certain value that will be repeated for many units. And this value will vary from batch to batch. So, although there is noise added, you are not killing units as a pure dropout is supposed to do.
The problem of using a regular Dropout
before a MaxPooling
is that you will zero some pixels, and then the MaxPooling
will take the maximum value, sort of ignoring part of your dropout. If your dropout happens to hit a maximum pixel, then the pooling will result in the second maximum, not in zero.
So, Dropout
before MaxPooling
reduces the effectiveness of the dropout.
But, a SpatialDropout
never hits "pixels", it only hits channels. When it hits a channel, it will zero all pixels for that channel, thus, the MaxPooling
will effectively result in zero too.
So, there is no difference between spatial dropout before of after the pooling. An entire "channel" will be zero in both orders.
Depending on the activation function, using a batch normalization before it can be a good advantage.
For a 'relu'
activation, the normalization makes the model fail-safe against a bad luck case of "all zeros freeze a relu layer". It will also tend to guarantee that half of the units will be zero and the other half linear.
For a 'sigmoid'
or a 'tahn'
, the BatchNormalization
will guarantee that the values are within a healthy range, avoiding saturation and vanishing gradients (values that are too far from zero will hit an almost flat region of these functions, causing vanishing gradients).
There are people that say there are other advantages if you do the contrary, I'm not fully aware of these advantages, I like the ones I mentioned very much.
With 'relu'
, there is no difference, it can be proved that the results are exactly the same.
With activations that are not centerd, such as 'sigmoid'
putting a dropout before the activation will not result in "zeros", but in other values. For a sigmoid, the final results of the dropout before it would be 0.5.
If you add a 'tanh'
after a dropout, for instance, you will have the zeros, but the scaling that dropout applies to keep the same mean will be distorted by the tanh. (I don't know if this is a big problem, but might be)
I don't see much here. If the activation is not very weird, the final result would be the same.
There are possibilities, but some are troublesome. I find the following order a good one and often use it
I would do something like
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With