Suppose that we have an LSTM model for time series forecasting. Also, this is a multivariate case, so we're using more than one feature for training the model.
ipt = Input(shape = (shape[0], shape[1])
x = Dropout(0.3)(ipt) ## Dropout before LSTM.
x = CuDNNLSTM(10, return_sequences = False)(x)
out = Dense(1, activation='relu')(x)
We can add Dropout
layer before LSTM (like the above code) or after LSTM.
If we add it before LSTM, is it applying dropout on timesteps (different lags of time series), or different input features, or both of them?
If we add it after LSTM and because return_sequences
is False
, what is dropout doing here?
Is there any different between dropout
option in LSTM
and dropout layer before LSTM
layer?
Usually, dropout is placed on the fully connected layers only because they are the one with the greater number of parameters and thus they're likely to excessively co-adapting themselves causing overfitting. However, since it's a stochastic regularization technique, you can really place it everywhere.
Use With Smaller Datasets Like other regularization methods, dropout is more effective on those problems where there is a limited amount of training data and the model is likely to overfit the training data. Problems where there is a large amount of training data may see less benefit from using dropout.
Dropout is a regularization method where input and recurrent connections to LSTM units are probabilistically excluded from activation and weight updates while training a network. This has the effect of reducing overfitting and improving model performance.
We must not use dropout layer after convolutional layer as we slide the filter over the width and height of the input image we produce a 2-dimensional activation map that gives the responses of that filter at every spatial position.
As default, Dropout
creates a random tensor of zeros an ones. No pattern, no privileged axis. So, you can't say a specific thing is being dropped, just random coordinates in the tensor. (Well, it drops features, but different features for each step, and differently for each sample)
You can, if you want, use the noise_shape
property, which will define the shape of the random tensor. Then you can select if you want to drop steps, features or samples, or maybe a combination.
noise_shape = (1,steps,1)
noise_shape = (1,1, features)
noise_shape = (None, 1, 1)
There is also the SpatialDropout1D
layer, which uses noise_shape = (input_shape[0], 1, input_shape[2])
automatically. This drops the same feature for all time steps, but treats each sample individually (each sample will drop a different group of features).
After the LSTM
you have shape = (None, 10)
. So, you use Dropout
the same way you would use in any fully connected network. It drops a different group of features for each sample.
A dropout as an argument to the LSTM
has a lot of differences. It generates 4 different dropout masks, for creating different inputs for each of the different gates. (You can see the LSTMCell code to check this).
Also, there is the option of recurrent_dropout
, which will generate 4 dropout masks, but to be applied to the states instead of the inputs, each step of the recurrent calculations.
You are confusing Dropout
with it's variant SpatialDropoutND
(either 1D
, 2D
or 3D
). See documentation
(apparently you can't link specific class).
Dropout
applies random binary mask to input, no matter the shape, except first dimension (batch), so it applies to features and and timesteps in this case.
Here, if return_sequences=False
, you only get output from last timestep, so it would be of size [batch, 10]
in your case. Dropout will randomly drop value from the second dimension
Yes, there is a difference, as dropout
is for time steps when LSTM
produces sequences (e.g. sequences of 10
goes through the unrolled LSTM and some of the features are dropped before going into the next cell). Dropout
would drop random elements (except batch dimension). SpatialDropout1D
would drop entire channels, in this case some timesteps would be entirely dropped out (in the convolution case, you could use SpatialDropout2D
to drop channels, either input or along the network).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With