Occasionally I see some models are using <code>SpatialDropout1D</code> instead of <code>Dropout</code>. For example, in the Part of speech tagging neural network, they use: <pre class="prettyprint lang-python prettyprint-override"><code>model = Sequential() model.add(Embedding(s_vocabsize, EMBED_SIZE, input_length=MAX_SEQLEN)) model.add(SpatialDropout1D(0.2)) ##This model.add(GRU(HIDDEN_SIZE, dropout=0.2, recurrent_dropout=0.2)) model.add(RepeatVector(MAX_SEQLEN)) model.add(GRU(HIDDEN_SIZE, return_sequences=True)) model.add(TimeDistributed(Dense(t_vocabsize))) model.add(Activation("softmax")) </code></pre> According to Keras' documentation, it says: <blockquote> This version performs the same function as Dropout, however it drops entire 1D feature maps instead of individual elements. </blockquote> However, I am unable to understand the meaning of entrie 1D feature. More specifically, I am unable to visualize <code>SpatialDropout1D</code> in the same model explained in quora. Can someone explain this concept by using the same model as in quora? Also, under what situation we will use <code>SpatialDropout1D</code> instead of <code>Dropout</code>?

To make it simple, I would first note that so-called feature maps (1D, 2D, etc.) is our regular channels. Let's look at examples: <ol> <li><code>Dropout()</code>: Let's define 2D input: [[1, 1, 1], [2, 2, 2]]. Dropout will consider every element independently, and may result in something like [[1, 0, 1], [0, 2, 2]]</li> <li><code>SpatialDropout1D()</code>: In this case result will look like [[1, 0, 1], [2, 0, 2]]. Notice that 2nd element was zeroed along all channels.</li> </ol>

How to understand SpatialDropout1D and when to use it?

Tags:

machine-learning

deep-learning

keras

conv-neural-network

dropout

Occasionally I see some models are using SpatialDropout1D instead of Dropout. For example, in the Part of speech tagging neural network, they use:

model = Sequential() model.add(Embedding(s_vocabsize, EMBED_SIZE,                     input_length=MAX_SEQLEN)) model.add(SpatialDropout1D(0.2)) ##This model.add(GRU(HIDDEN_SIZE, dropout=0.2, recurrent_dropout=0.2)) model.add(RepeatVector(MAX_SEQLEN)) model.add(GRU(HIDDEN_SIZE, return_sequences=True)) model.add(TimeDistributed(Dense(t_vocabsize))) model.add(Activation("softmax"))

According to Keras' documentation, it says:

This version performs the same function as Dropout, however it drops entire 1D feature maps instead of individual elements.

However, I am unable to understand the meaning of entrie 1D feature. More specifically, I am unable to visualize SpatialDropout1D in the same model explained in quora. Can someone explain this concept by using the same model as in quora?

Also, under what situation we will use SpatialDropout1D instead of Dropout?

202

asked May 17 '18 14:05

Raven Cheuk

2 Answers

To make it simple, I would first note that so-called feature maps (1D, 2D, etc.) is our regular channels. Let's look at examples:

Dropout(): Let's define 2D input: [[1, 1, 1], [2, 2, 2]]. Dropout will consider every element independently, and may result in something like [[1, 0, 1], [0, 2, 2]]
SpatialDropout1D(): In this case result will look like [[1, 0, 1], [2, 0, 2]]. Notice that 2nd element was zeroed along all channels.

answered Sep 29 '22 22:09

Dilshat

The noise shape

In order to understand SpatialDropout1D, you should get used to the notion of the noise shape. In plain vanilla dropout, each element is kept or dropped independently. For example, if the tensor is [2, 2, 2], each of 8 elements can be zeroed out depending on random coin flip (with certain "heads" probability); in total, there will be 8 independent coin flips and any number of values may become zero, from 0 to 8.

Sometimes there is a need to do more than that. For example, one may need to drop the whole slice along 0 axis. The noise_shape in this case is [1, 2, 2] and the dropout involves only 4 independent random coin flips. The first component will either be kept together or be dropped together. The number of zeroed elements can be 0, 2, 4, 6 or 8. It cannot be 1 or 5.

Another way to view this is to imagine that input tensor is in fact [2, 2], but each value is double-precision (or multi-precision). Instead of dropping the bytes in the middle, the layer drops the full multi-byte value.

Why is it useful?

The example above is just for illustration and isn't common in real applications. More realistic example is this: shape(x) = [k, l, m, n] and noise_shape = [k, 1, 1, n]. In this case, each batch and channel component will be kept independently, but each row and column will be kept or not kept together. In other words, the whole [l, m] feature map will be either kept or dropped.

You may want to do this to account for adjacent pixels correlation, especially in the early convolutional layers. Effectively, you want to prevent co-adaptation of pixels with its neighbors across the feature maps, and make them learn as if no other feature maps exist. This is exactly what SpatialDropout2D is doing: it promotes independence between feature maps.

The SpatialDropout1D is very similar: given shape(x) = [k, l, m] it uses noise_shape = [k, 1, m] and drops entire 1-D feature maps.

Reference: Efficient Object Localization Using Convolutional Networks by Jonathan Tompson at al.

answered Sep 30 '22 00:09

Maxim

Related questions
                            
                                scikit-learn return value of LogisticRegression.predict_proba
                            
                                What is "metrics" in Keras?
                            
                                What is `lr_policy` in Caffe?
                            
                                Unknown initializer: GlorotUniform when loading Keras model
                            
                                What are the differences between all these cross-entropy losses in Keras and TensorFlow?
                            
                                Shuffling training data with LSTM RNN
                            
                                What does clf mean in machine learning?
                            
                                Suggest what user could buy if he already has something in the cart
                            
                                importance of PCA or SVD in machine learning
                            
                                TensorFlow operator overloading
                            
                                How to understand the term `tensor` in TensorFlow?
                            
                                Neural Networks: What does "linearly separable" mean?
                            
                                xgboost in R: how does xgb.cv pass the optimal parameters into xgb.train
                            
                                How to pick a language for Artificial Intelligence programming? [closed]
                            
                                ResNet: 100% accuracy during training, but 33% prediction accuracy with the same data
                            
                                Correlated features and classification accuracy
                            
                                Machine Learning & Big Data [closed]
                            
                                Machine Learning Algorithm for Predicting Order of Events?
                            
                                Hyperparameter optimization for Pytorch model [closed]
                            
                                Difference between standardscaler and Normalizer in sklearn.preprocessing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With