I know this is a subject with a lot of questions but I couldn't find any solution to my problem.
I am training a LSTM network on variable-length inputs using a masking layer but it seems that it doesn't have any effect.
Input shape (100, 362, 24) with 362 being the maximum sequence lenght, 24 the number of features and 100 the number of samples (divided 75 train / 25 valid).
Output shape (100, 362, 1) transformed later to (100, 362 - N, 1).
Here is the code for my network:
from keras import Sequential
from keras.layers import Embedding, Masking, LSTM, Lambda
import keras.backend as K
# O O O
# example for N:3 | | |
# O O O O O O
# | | | | | |
# O O O O O O
N = 5
y= y[:,N:,:]
x_train = x[:75]
x_test = x[75:]
y_train = y[:75]
y_test = y[75:]
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(timesteps, features)))
model.add(LSTM(128, return_sequences=True))
model.add(LSTM(64, return_sequences=True))
model.add(LSTM(1, return_sequences=True))
model.add(Lambda(lambda x: x[:, N:, :]))
model.compile('adam', 'mae')
print(model.summary())
history = model.fit(x_train, y_train,
epochs=3,
batch_size=15,
validation_data=[x_test, y_test])
my data is padded at the end. example:
>> x_test[10,350]
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0.], dtype=float32)
The problem is that the mask layer seems to have no effect. I can see it with the loss value being printed during training which is equal to the one without mask I calculate after:
Layer (type) Output Shape Param #
=================================================================
masking_1 (Masking) (None, 362, 24) 0
_________________________________________________________________
lstm_1 (LSTM) (None, 362, 128) 78336
_________________________________________________________________
lstm_2 (LSTM) (None, 362, 64) 49408
_________________________________________________________________
lstm_3 (LSTM) (None, 362, 1) 264
_________________________________________________________________
lambda_1 (Lambda) (None, 357, 1) 0
=================================================================
Total params: 128,008
Trainable params: 128,008
Non-trainable params: 0
_________________________________________________________________
None
Train on 75 samples, validate on 25 samples
Epoch 1/3
75/75 [==============================] - 8s 113ms/step - loss: 0.1711 - val_loss: 0.1814
Epoch 2/3
75/75 [==============================] - 5s 64ms/step - loss: 0.1591 - val_loss: 0.1307
Epoch 3/3
75/75 [==============================] - 5s 63ms/step - loss: 0.1057 - val_loss: 0.1034
>> from sklearn.metrics import mean_absolute_error
>> out = model.predict(x_test, batch_size=1)
>> print('wo mask', mean_absolute_error(y_test.ravel(), out.ravel()))
>> print('w mask', mean_absolute_error(y_test[~(x_test[:,N:] == 0).all(axis=2)].ravel(), out[~(x_test[:,N:] == 0).all(axis=2)].ravel()))
wo mask 0.10343371
w mask 0.16236152
Futhermore, if I use nan value for the masked output values, I can see the nan being propagated during training (loss equals nan).
What am I missing to make the masking layer work as expected?
A Masking layer is meant to "ignore steps" in sequences. Your LSTM is working with sequences of 5 steps and 42 features per step. If all features in a step have the same value defined in Masking ( -1 in the example), that step will be ignored during training. The idea is to simulate variable length sequences.
Masking is a way to tell sequence-processing layers that certain timesteps in an input are missing, and thus should be skipped when processing the data. Padding is a special form of masking where the masked steps are at the start or the end of a sequence.
mask_zero: Boolean, whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True , then all subsequent layers in the model need to support masking or an exception will be raised.
The Lambda
layer, by default, does not propagate masks. In other words, the mask tensor computed by the Masking
layer is thrown away by the Lambda
layer, and thus the Masking
layer has no effect on the output loss.
If you want the compute_mask
method of a Lambda
layer to propagate previous mask, you have to provide the mask
argument when the layer is created. As can be seen from the source code of Lambda
layer,
def __init__(self, function, output_shape=None,
mask=None, arguments=None, **kwargs):
# ...
if mask is not None:
self.supports_masking = True
self.mask = mask
# ...
def compute_mask(self, inputs, mask=None):
if callable(self.mask):
return self.mask(inputs, mask)
return self.mask
Because the default value of mask
is None
, compute_mask
returns None
and the loss is not masked at all.
To fix the problem, since your Lambda
layer itself does not introduce any additional masking, the compute_mask
method should just return the mask from the previous layer (with appropriate slicing to match the output shape of the layer).
masking_func = lambda inputs, previous_mask: previous_mask[:, N:]
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(timesteps, features)))
model.add(LSTM(128, return_sequences=True))
model.add(LSTM(64, return_sequences=True))
model.add(LSTM(1, return_sequences=True))
model.add(Lambda(lambda x: x[:, N:, :], mask=masking_func))
Now you should be able to see the correct loss value.
>> model.evaluate(x_test, y_test, verbose=0)
0.2660679519176483
>> out = model.predict(x_test)
>> print('wo mask', mean_absolute_error(y_test.ravel(), out.ravel()))
wo mask 0.26519736809498456
>> print('w mask', mean_absolute_error(y_test[~(x_test[:,N:] == 0).all(axis=2)].ravel(), out[~(x_test[:,N:] == 0).all(axis=2)].ravel()))
w mask 0.2660679670482195
Using NaN value for padding does not work because masking is done by multiplying the loss tensor with a binary mask (0 * nan
is still nan
, so the mean value would be nan
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With