Suppose that I have a model like this (this is a model for time series forecasting):
ipt = Input((data.shape[1] ,data.shape[2])) # 1
x = Conv1D(filters = 10, kernel_size = 3, padding = 'causal', activation = 'relu')(ipt) # 2
x = LSTM(15, return_sequences = False)(x) # 3
x = BatchNormalization()(x) # 4
out = Dense(1, activation = 'relu')(x) # 5
Now I want to add batch normalization layer to this network. Considering the fact that batch normalization doesn't work with LSTM, Can I add it before Conv1D
layer? I think it's rational to have a batch normalization layer after LSTM
.
Also, where can I add Dropout in this network? The same places? (after or before batch normalization?)
AveragePooling1D
between Conv1D
and LSTM
? Is it possible to add batch normalization between Conv1D
and AveragePooling1D
in this case without any effect on LSTM
layer?Description A batch normalization layer normalizes a mini-batch of data across all observations for each channel independently. To speed up training of the convolutional neural network and reduce the sensitivity to network initialization, use batch normalization layers between convolutional layers and nonlinearities, such as ReLU layers.
There are also reparametrizations of the LSTM layer that allow Batch Normalization to be used, for example as described in Recurrent Batch Normalization by Coijmaans et al. 2016. This answer is not correct.
Layer Normalization (LN) Inspired by the results of Batch Normalization, Geoffrey Hinton et al. proposed Layer Normalization which normalizes the activations along the feature direction instead of mini-batch direction.
Other techniques similar to Batch Normalization that take these limitations into account have been developed, for example Layer Normalization. There are also reparametrizations of the LSTM layer that allow Batch Normalization to be used, for example as described in Recurrent Batch Normalization by Coijmaans et al. 2016.
Update: the LayerNormalization implementation I was using was inter-layer, not recurrent as in the original paper; results with latter may prove superior.
BatchNormalization
can work with LSTMs - the linked SO gives false advice; in fact, in my application of EEG classification, it dominated LayerNormalization
. Now to your case:
Conv1D
"? Don't - instead, standardize your data beforehand, else you're employing an inferior variant to do the same thingBatchNormalization
before an activation, and after - apply to both Conv1D
and LSTM
BN
after LSTM
may be counterproductive per ability to introduce noise, which can confuse the classifier layer - but this is about being one layer before output, not LSTM
LSTM
with return_sequences=True
preceding return_sequences=False
, you can place Dropout
anywhere - before LSTM
, after, or bothrecurrent_dropout
is still preferable to Dropout
for LSTM
- however, you can do both; just do not use with with activation='relu'
, for which LSTM
is unstable per a bugPooling
is redundant and may harm performance; scarce data is better transformed via a non-linearity than simple averaging opsSqueezeExcite
block after your Conv; it's a form of self-attention - see paper; my implementation for 1D belowactivation='selu'
with AlphaDropout
and 'lecun_normal'
initialization, per paper Self Normalizing Neural Networks
Below is an example template you can use as a starting point; I also recommend the following SO's for further reading: Regularizing RNNs, and Visualizing RNN gradients
from keras.layers import Input, Dense, LSTM, Conv1D, Activation
from keras.layers import AlphaDropout, BatchNormalization
from keras.layers import GlobalAveragePooling1D, Reshape, multiply
from keras.models import Model
import keras.backend as K
import numpy as np
def make_model(batch_shape):
ipt = Input(batch_shape=batch_shape)
x = ConvBlock(ipt)
x = LSTM(16, return_sequences=False, recurrent_dropout=0.2)(x)
# x = BatchNormalization()(x) # may or may not work well
out = Dense(1, activation='relu')
model = Model(ipt, out)
model.compile('nadam', 'mse')
return model
def make_data(batch_shape): # toy data
return (np.random.randn(*batch_shape),
np.random.uniform(0, 2, (batch_shape[0], 1)))
batch_shape = (32, 21, 20)
model = make_model(batch_shape)
x, y = make_data(batch_shape)
model.train_on_batch(x, y)
Functions used:
def ConvBlock(_input): # cleaner code
x = Conv1D(filters=10, kernel_size=3, padding='causal', use_bias=False,
kernel_initializer='lecun_normal')(_input)
x = BatchNormalization(scale=False)(x)
x = Activation('selu')(x)
x = AlphaDropout(0.1)(x)
out = SqueezeExcite(x)
return out
def SqueezeExcite(_input, r=4): # r == "reduction factor"; see paper
filters = K.int_shape(_input)[-1]
se = GlobalAveragePooling1D()(_input)
se = Reshape((1, filters))(se)
se = Dense(filters//r, activation='relu', use_bias=False,
kernel_initializer='he_normal')(se)
se = Dense(filters, activation='sigmoid', use_bias=False,
kernel_initializer='he_normal')(se)
return multiply([_input, se])
Spatial Dropout: pass noise_shape = (batch_size, 1, channels)
to Dropout
- has the effect below; see Git gist for code:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With