I would like to apply layer normalization to a recurrent neural network using tf.keras. In TensorFlow 2.0, there is a LayerNormalization
class in tf.layers.experimental
, but it's unclear how to use it within a recurrent layer like LSTM
, at each time step (as it was designed to be used). Should I create a custom cell, or is there a simpler way?
For example, applying dropout at each time step is as easy as setting the recurrent_dropout
argument when creating an LSTM
layer, but there is no recurrent_layer_normalization
argument.
A Normalization layer should always either be adapted over a dataset or passed mean and variance . During adapt() , the layer will compute a mean and variance separately for each position in each axis specified by the axis argument. To calculate a single mean and variance over the input data, simply pass axis=None .
Normalization classA preprocessing layer which normalizes continuous features. This layer will shift and scale inputs into a distribution centered around 0 with standard deviation 1. It accomplishes this by precomputing the mean and variance of the data, and calling (input - mean) / sqrt(var) at runtime.
Layer Normalization(LN) proposed Layer Normalization which normalizes the activations along the feature direction instead of mini-batch direction. This overcomes the cons of BN by removing the dependency on batches and makes it easier to apply for RNNs as well.
The normalize function just performs a regular normalization to improve performance: Normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1.
You can create a custom cell by inheriting from the SimpleRNNCell
class, like this:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.activations import get as get_activation
from tensorflow.keras.layers import SimpleRNNCell, RNN, Layer
from tensorflow.keras.layers.experimental import LayerNormalization
class SimpleRNNCellWithLayerNorm(SimpleRNNCell):
def __init__(self, units, **kwargs):
self.activation = get_activation(kwargs.get("activation", "tanh"))
kwargs["activation"] = None
super().__init__(units, **kwargs)
self.layer_norm = LayerNormalization()
def call(self, inputs, states):
outputs, new_states = super().call(inputs, states)
norm_out = self.activation(self.layer_norm(outputs))
return norm_out, [norm_out]
This implementation runs a regular SimpleRNN
cell for one step without any activation
, then it applies layer norm to the resulting output, then it applies the activation
. Then you can use it like that:
model = Sequential([
RNN(SimpleRNNCellWithLayerNorm(20), return_sequences=True,
input_shape=[None, 20]),
RNN(SimpleRNNCellWithLayerNorm(5)),
])
model.compile(loss="mse", optimizer="sgd")
X_train = np.random.randn(100, 50, 20)
Y_train = np.random.randn(100, 5)
history = model.fit(X_train, Y_train, epochs=2)
For GRU and LSTM cells, people generally apply layer norm on the gates (after the linear combination of the inputs and states, and before the sigmoid activation), so it's a bit trickier to implement. Alternatively, you can probably get good results by just applying layer norm before applying activation
and recurrent_activation
, which would be easier to implement.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With