Suppose that we have an LSTM model for time series forecasting. Also, this is a multivariate case, so we're using more than one feature for training the model. <pre class="prettyprint"><code>ipt = Input(shape = (shape[0], shape[1]) x = Dropout(0.3)(ipt) ## Dropout before LSTM. x = CuDNNLSTM(10, return_sequences = False)(x) out = Dense(1, activation='relu')(x) </code></pre> We can add <code>Dropout</code> layer before LSTM (like the above code) or after LSTM. <ul> <li>If we add it before LSTM, is it applying dropout on timesteps (different lags of time series), or different input features, or both of them?</li> <li>If we add it after LSTM and because <code>return_sequences</code> is <code>False</code>, what is dropout doing here?</li> <li>Is there any different between <code>dropout</code> option in <code>LSTM</code> and dropout layer before <code>LSTM</code> layer?</li> </ul>

As default, <code>Dropout</code> creates a random tensor of zeros an ones. No pattern, no privileged axis. So, you can't say a specific thing is being dropped, just random coordinates in the tensor. (Well, it drops features, but different features for each step, and differently for each sample) You can, if you want, use the <code>noise_shape</code> property, which will define the shape of the random tensor. Then you can select if you want to drop steps, features or samples, or maybe a combination. <ul> <li>Dropping time steps: <code>noise_shape = (1,steps,1)</code> </li> <li>Dropping features: <code>noise_shape = (1,1, features)</code> </li> <li>Dropping samples: <code>noise_shape = (None, 1, 1)</code> </li> </ul> There is also the <code>SpatialDropout1D</code> layer, which uses <code>noise_shape = (input_shape[0], 1, input_shape[2])</code> automatically. This drops the same feature for all time steps, but treats each sample individually (each sample will drop a different group of features). After the <code>LSTM</code> you have <code>shape = (None, 10)</code>. So, you use <code>Dropout</code> the same way you would use in any fully connected network. It drops a different group of features for each sample. A dropout as an argument to the <code>LSTM</code> has a lot of differences. It generates 4 different dropout masks, for creating different inputs for each of the different gates. (You can see the LSTMCell code to check this). Also, there is the option of <code>recurrent_dropout</code>, which will generate 4 dropout masks, but to be applied to the states instead of the inputs, each step of the recurrent calculations.

You are confusing <code>Dropout</code> with it's variant <code>SpatialDropoutND</code> (either <code>1D</code>, <code>2D</code> or <code>3D</code>). See <code>documentation</code> (apparently you can't link specific class). <ul> <li><code>Dropout</code> applies random binary mask to input, no matter the shape, except first dimension (batch), so it applies to features and and timesteps in this case.</li> <li>Here, if <code>return_sequences=False</code>, you only get output from last timestep, so it would be of size <code>[batch, 10]</code> in your case. Dropout will randomly drop value from the second dimension</li> <li>Yes, there is a difference, as <code>dropout</code> is for time steps when <code>LSTM</code> produces sequences (e.g. sequences of <code>10</code> goes through the unrolled LSTM and some of the features are dropped before going into the next cell). <code>Dropout</code> would drop random elements (except batch dimension). <code>SpatialDropout1D</code> would drop entire channels, in this case some timesteps would be entirely dropped out (in the convolution case, you could use <code>SpatialDropout2D</code> to drop channels, either input or along the network).</li> </ul>

Dropout layer before or after LSTM. What is the difference?

Tags:

tensorflow

keras

lstm

dropout

Suppose that we have an LSTM model for time series forecasting. Also, this is a multivariate case, so we're using more than one feature for training the model.

ipt   = Input(shape = (shape[0], shape[1])
x     = Dropout(0.3)(ipt) ## Dropout before LSTM.
x     = CuDNNLSTM(10, return_sequences = False)(x)
out   = Dense(1, activation='relu')(x)

We can add Dropout layer before LSTM (like the above code) or after LSTM.

If we add it before LSTM, is it applying dropout on timesteps (different lags of time series), or different input features, or both of them?
If we add it after LSTM and because return_sequences is False, what is dropout doing here?
Is there any different between dropout option in LSTM and dropout layer before LSTM layer?

526

asked Nov 07 '19 12:11

Eghbal

2 Answers

As default, Dropout creates a random tensor of zeros an ones. No pattern, no privileged axis. So, you can't say a specific thing is being dropped, just random coordinates in the tensor. (Well, it drops features, but different features for each step, and differently for each sample)

You can, if you want, use the noise_shape property, which will define the shape of the random tensor. Then you can select if you want to drop steps, features or samples, or maybe a combination.

Dropping time steps: noise_shape = (1,steps,1)
Dropping features: noise_shape = (1,1, features)
Dropping samples: noise_shape = (None, 1, 1)

There is also the SpatialDropout1D layer, which uses noise_shape = (input_shape[0], 1, input_shape[2]) automatically. This drops the same feature for all time steps, but treats each sample individually (each sample will drop a different group of features).

After the LSTM you have shape = (None, 10). So, you use Dropout the same way you would use in any fully connected network. It drops a different group of features for each sample.

A dropout as an argument to the LSTM has a lot of differences. It generates 4 different dropout masks, for creating different inputs for each of the different gates. (You can see the LSTMCell code to check this).

Also, there is the option of recurrent_dropout, which will generate 4 dropout masks, but to be applied to the states instead of the inputs, each step of the recurrent calculations.

172

answered Sep 22 '22 14:09

Daniel Möller

You are confusing Dropout with it's variant SpatialDropoutND (either 1D, 2D or 3D). See documentation (apparently you can't link specific class).

Dropout applies random binary mask to input, no matter the shape, except first dimension (batch), so it applies to features and and timesteps in this case.
Here, if return_sequences=False, you only get output from last timestep, so it would be of size [batch, 10] in your case. Dropout will randomly drop value from the second dimension
Yes, there is a difference, as dropout is for time steps when LSTM produces sequences (e.g. sequences of 10 goes through the unrolled LSTM and some of the features are dropped before going into the next cell). Dropout would drop random elements (except batch dimension). SpatialDropout1D would drop entire channels, in this case some timesteps would be entirely dropped out (in the convolution case, you could use SpatialDropout2D to drop channels, either input or along the network).

answered Sep 23 '22 14:09

Szymon Maszke

Related questions
                            
                                Why is accuracy from fit_generator different to that from evaluate_generator in Keras?
                            
                                Object is enumerable but not indexable?
                            
                                How to solve ImportError: Keras requires TensorFlow 2.2 or higher. Install TensorFlow via `pip install tensorflow`?
                            
                                tensorflow: how come gather_nd is differentiable?
                            
                                Epoch counter with TensorFlow Dataset API
                            
                                what is the behavior of SAME padding when stride is greater than 1?
                            
                                How to Do a Simple CLI Query for a Saved Estimator Model?
                            
                                Tensorflow: how to minimize under constraints
                            
                                Fetch request failing in Node.js (Client network socket disconnected before secure TLS connection was established)
                            
                                File system scheme '[local]' not implemented in Google Colab TPU
                            
                                How to load a trained TF1 protobuf model into TF2?
                            
                                How to use image_summary to view images from different batches in Tensorflow?
                            
                                TensorFlow GPU: is cudnn optional? Couldn't open CUDA library libcudnn.so
                            
                                TensorFlow: Adding Class to Pre-trained Inception Model & Outputting Full Image Hierarchy
                            
                                Tensorflow Estimator API: Summaries
                            
                                Why is my tf_gradients returning None?
                            
                                Tensorflow: access trained variables after closing the session
                            
                                Is the class generator (inheriting Sequence) thread safe in Keras/Tensorflow?
                            
                                Unable to build `Dense` layer with non-floating point dtype Error
                            
                                Keras conditional passing one model output to another model

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With