<p>For a CNN architecture I want to use SpatialDropout2D layer instead of Dropout layer. Additionaly I want to use BatchNormalization. So far I had always set the BatchNormalization directly after a Convolutional layer but before the activation function, as in the paper by Ioffe and Szegedy mentioned. The dropout layers I had always set after MaxPooling2D layer.</p> <p>In https://machinelearningmastery.com/how-to-reduce-overfitting-with-dropout-regularization-in-keras/ SpatialDropout2D is set directly after Convolutional layer.</p> <p>I find it rather confusing in which order I should now apply these layers. I had also read on a Keras page that SpatialDropout should be placed directly behind ConvLayer (but I can't find this page anymore).</p> <p>Is the following order correct?</p> <p>ConvLayer - SpatialDropout - BatchNormalization - Activation function - MaxPooling</p> <p>I really hope for tips and thank you in advance</p> <p><strong>Update</strong> My goal was actually to exchange in the following CNN architecture dropout for spatial dropout:</p> <pre class="prettyprint"><code>model = Sequential() model.add(Conv2D(32,(3,3)) model.add(BatchNormalization()) model.add(Activation('relu')) model.add(Conv2D(32,(3,3)) model.add(BatchNormalization()) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2,2)) model.add(Dropout(0.2)) model.add(Conv2D(64, (3,3)) model.add(BatchNormalization()) model.add(Activation('relu')) model.add(Conv2D(64,(3,3)) model.add(BatchNormalization()) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2,2)) model.add(Dropout(0.2)) model.add(Flatten()) model.add(Dense(512)) model.add(BatchNormalization()) model.add(Activation('relu')) model.add(Dropout(0.4)) model.add(Dense(10)) model.add(Activation('softmax')) </code></pre>

<h3>Dropout vs BatchNormalization - Standard deviation issue</h3> <p>There is a big problem that appears when you mix these layers, especially when <code>BatchNormalization</code> is right after <code>Dropout</code>.</p> <p>Dropouts try to keep the same mean of the outputs without dropouts, but it does change the standard deviation, which will cause a huge difference in the BatchNormalization between training and validation. (During training, the <code>BatchNormalization</code> receives changed standard deviations, accumulates and stores them. During validation, the dropouts are turned off, the standard deviation is not a changed one anymore, but the original. But <code>BatchNormalization</code>, because it's in validation, will not use the batch statistics, but the stored statistics, which will be very different from the batch statistics)</p> <p>So, the first and most important rule is: don't place a <code>BatchNormalization</code> after a <code>Dropout</code> (or a <code>SpatialDropout</code>).</p> <p>Usually, I try to leave at least two convolutional/dense layers without any dropout before applying a batch normalization, to avoid this.</p> <h3>Dropout vs BatchNormalization - Changing the zeros to another value</h3> <p>Also important: the role of the <code>Dropout</code> is to "zero" the influence of some of the weights of the next layer. If you apply a normalization after the dropout, you will not have "zeros" anymore, but a certain value that will be repeated for many units. And this value will vary from batch to batch. So, although there is noise added, you are not killing units as a pure dropout is supposed to do.</p> <h3>Dropout vs MaxPooling</h3> <p>The problem of using a regular <code>Dropout</code> before a <code>MaxPooling</code> is that you will zero some pixels, and then the <code>MaxPooling</code> will take the maximum value, sort of ignoring part of your dropout. If your dropout happens to hit a maximum pixel, then the pooling will result in the second maximum, not in zero.</p> <p>So, <code>Dropout</code> before <code>MaxPooling</code> reduces the effectiveness of the dropout.</p> <h3>SpatialDropout vs MaxPooling</h3> <p>But, a <code>SpatialDropout</code> never hits "pixels", it only hits channels. When it hits a channel, it will zero all pixels for that channel, thus, the <code>MaxPooling</code> will effectively result in zero too.</p> <p>So, there is no difference between <strong>spatial</strong> dropout before of after the pooling. An entire "channel" will be zero in both orders.</p> <h3>BatchNormalization vs Activation</h3> <p>Depending on the activation function, using a batch normalization before it can be a good advantage.</p> <p>For a <code>'relu'</code> activation, the normalization makes the model fail-safe against a bad luck case of "all zeros freeze a relu layer". It will also tend to guarantee that half of the units will be zero and the other half linear.</p> <p>For a <code>'sigmoid'</code> or a <code>'tahn'</code>, the <code>BatchNormalization</code> will guarantee that the values are within a healthy range, avoiding saturation and vanishing gradients (values that are too far from zero will hit an almost flat region of these functions, causing vanishing gradients).</p> <p>There are people that say there are other advantages if you do the contrary, I'm not fully aware of these advantages, I like the ones I mentioned very much.</p> <h3>Dropout vs Activation</h3> <p>With <code>'relu'</code>, there is no difference, it can be proved that the results are exactly the same.</p> <p>With activations that are not centerd, such as <code>'sigmoid'</code> putting a dropout before the activation will not result in "zeros", but in other values. For a sigmoid, the final results of the dropout before it would be 0.5.</p> <p>If you add a <code>'tanh'</code> after a dropout, for instance, you will have the zeros, but the scaling that dropout applies to keep the same mean will be distorted by the tanh. (I don't know if this is a big problem, but might be)</p> <h3>MaxPooling vs Activation</h3> <p>I don't see much here. If the activation is not very weird, the final result would be the same.</p> <h3>Conclusions?</h3> <p>There are possibilities, but some are troublesome. I find the following order a good one and often use it</p> <p>I would do something like</p> <ul> <li>Group1 <ul> <li>Conv</li> <li>BatchNorm</li> <li>Activation</li> <li>MaxPooling</li> <li>Dropout or SpatialDropout</li> </ul> </li> <li>Group2 <ul> <li>Conv</li> <li>----- (there was a dropout in the last group, no BatchNorm here)</li> <li>Activation</li> <li>MaxPooling</li> <li>Dropout or SpatialDropout (decide to use or not)</li> </ul> </li> <li>After two groups without dropout <ul> <li>can use BatchNorm again</li> </ul> </li> </ul>

correct order for SpatialDropout2D, BatchNormalization and activation function?

Tags:

keras

batch-normalization

activation-function

dropout

For a CNN architecture I want to use SpatialDropout2D layer instead of Dropout layer. Additionaly I want to use BatchNormalization. So far I had always set the BatchNormalization directly after a Convolutional layer but before the activation function, as in the paper by Ioffe and Szegedy mentioned. The dropout layers I had always set after MaxPooling2D layer.

In https://machinelearningmastery.com/how-to-reduce-overfitting-with-dropout-regularization-in-keras/ SpatialDropout2D is set directly after Convolutional layer.

I find it rather confusing in which order I should now apply these layers. I had also read on a Keras page that SpatialDropout should be placed directly behind ConvLayer (but I can't find this page anymore).

Is the following order correct?

ConvLayer - SpatialDropout - BatchNormalization - Activation function - MaxPooling

I really hope for tips and thank you in advance

Update My goal was actually to exchange in the following CNN architecture dropout for spatial dropout:

model = Sequential()
model.add(Conv2D(32,(3,3))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Conv2D(32,(3,3))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2))
model.add(Dropout(0.2))

model.add(Conv2D(64, (3,3))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Conv2D(64,(3,3))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2))
model.add(Dropout(0.2))

model.add(Flatten())
model.add(Dense(512))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.4))
model.add(Dense(10))
model.add(Activation('softmax'))

766

asked Jan 07 '20 19:01

Code Now

1 Answers

Dropout vs BatchNormalization - Standard deviation issue

There is a big problem that appears when you mix these layers, especially when BatchNormalization is right after Dropout.

Dropouts try to keep the same mean of the outputs without dropouts, but it does change the standard deviation, which will cause a huge difference in the BatchNormalization between training and validation. (During training, the BatchNormalization receives changed standard deviations, accumulates and stores them. During validation, the dropouts are turned off, the standard deviation is not a changed one anymore, but the original. But BatchNormalization, because it's in validation, will not use the batch statistics, but the stored statistics, which will be very different from the batch statistics)

So, the first and most important rule is: don't place a BatchNormalization after a Dropout (or a SpatialDropout).

Usually, I try to leave at least two convolutional/dense layers without any dropout before applying a batch normalization, to avoid this.

Dropout vs BatchNormalization - Changing the zeros to another value

Also important: the role of the Dropout is to "zero" the influence of some of the weights of the next layer. If you apply a normalization after the dropout, you will not have "zeros" anymore, but a certain value that will be repeated for many units. And this value will vary from batch to batch. So, although there is noise added, you are not killing units as a pure dropout is supposed to do.

Dropout vs MaxPooling

The problem of using a regular Dropout before a MaxPooling is that you will zero some pixels, and then the MaxPooling will take the maximum value, sort of ignoring part of your dropout. If your dropout happens to hit a maximum pixel, then the pooling will result in the second maximum, not in zero.

So, Dropout before MaxPooling reduces the effectiveness of the dropout.

SpatialDropout vs MaxPooling

But, a SpatialDropout never hits "pixels", it only hits channels. When it hits a channel, it will zero all pixels for that channel, thus, the MaxPooling will effectively result in zero too.

So, there is no difference between spatial dropout before of after the pooling. An entire "channel" will be zero in both orders.

BatchNormalization vs Activation

Depending on the activation function, using a batch normalization before it can be a good advantage.

For a 'relu' activation, the normalization makes the model fail-safe against a bad luck case of "all zeros freeze a relu layer". It will also tend to guarantee that half of the units will be zero and the other half linear.

For a 'sigmoid' or a 'tahn', the BatchNormalization will guarantee that the values are within a healthy range, avoiding saturation and vanishing gradients (values that are too far from zero will hit an almost flat region of these functions, causing vanishing gradients).

There are people that say there are other advantages if you do the contrary, I'm not fully aware of these advantages, I like the ones I mentioned very much.

Dropout vs Activation

With 'relu', there is no difference, it can be proved that the results are exactly the same.

With activations that are not centerd, such as 'sigmoid' putting a dropout before the activation will not result in "zeros", but in other values. For a sigmoid, the final results of the dropout before it would be 0.5.

If you add a 'tanh' after a dropout, for instance, you will have the zeros, but the scaling that dropout applies to keep the same mean will be distorted by the tanh. (I don't know if this is a big problem, but might be)

MaxPooling vs Activation

I don't see much here. If the activation is not very weird, the final result would be the same.

Conclusions?

There are possibilities, but some are troublesome. I find the following order a good one and often use it

I would do something like

Group1
- Conv
- BatchNorm
- Activation
- MaxPooling
- Dropout or SpatialDropout
Group2
- Conv
- ----- (there was a dropout in the last group, no BatchNorm here)
- Activation
- MaxPooling
- Dropout or SpatialDropout (decide to use or not)
After two groups without dropout
- can use BatchNorm again

161

answered Oct 17 '22 17:10

Daniel Möller

Related questions
                            
                                TypeError: 'numpy.float64' object is not iterable Keras
                            
                                Why do tensorflow and keras SimpleRNN layers have a default activation of tanh
                            
                                How to use Keras to build a Part-of-Speech tagger?
                            
                                Keras How to use max_value in Relu activation function
                            
                                what is the meaning of border_mode in keras?
                            
                                keras usage of the Activation layer instead of activation parameter
                            
                                Multi-dimensional regression with Keras
                            
                                How to extract False Positive, False Negative from a confusion matrix of multiclass classification
                            
                                How to split a model trained in keras?
                            
                                What is the architecture behind the Keras LSTM Layer implementation?
                            
                                Handle invalid/corrupted image files in ImageDataGenerator.flow_from_directory in Keras
                            
                                Why am I receive AlreadyExistsError?
                            
                                Not able to use Stratified-K-Fold on multi label classifier
                            
                                How to edit Google Colaboratory libraries?
                            
                                Removing dimension using reshape in keras?
                            
                                Keras ImportError: cannot import name initializations
                            
                                Sequential has no attribution "validation_data"
                            
                                Why is validation accuracy higher than training accuracy when applying data augmentation?
                            
                                "Only absolute URLs are supported" when loading Keras model in Tensorflow.js with loadLayersModel
                            
                                How to generate CNN heatmaps using built-in Keras in TF2.0 (tf.keras)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With