Sensors (of the same type) scattered on my site are manually reporting on irregular intervals to my backend. Between reports the sensors aggregate events and report them as a batch.
The following dataset is a collection of sequence events data, batch collected. For example sensor 1 reported 2 times. On the first batch 2 events and on the second batch 3 events, while sensor 2 reported 1 time with 3 events.
I would like to use this data as my train data X
sensor_id | batch_id | timestamp | feature_1 | feature_n |
---|---|---|---|---|
1 | 1 | 2020-12-21T00:00:00+00:00 | 0.54 | 0.33 |
1 | 1 | 2020-12-21T01:00:00+00:00 | 0.23 | 0.14 |
1 | 2 | 2020-12-21T03:00:00+00:00 | 0.51 | 0.13 |
1 | 2 | 2020-12-21T04:00:00+00:00 | 0.23 | 0.24 |
1 | 2 | 2020-12-21T05:00:00+00:00 | 0.33 | 0.44 |
2 | 1 | 2020-12-21T00:00:00+00:00 | 0.54 | 0.33 |
2 | 1 | 2020-12-21T01:00:00+00:00 | 0.23 | 0.14 |
2 | 1 | 2020-12-21T03:00:00+00:00 | 0.51 | 0.13 |
My target y, is a score calculated from all the events collected by a sensor:
I.E socre_sensor_1 = f([[batch1...],[batch2...]])
sensor_id | final_score |
---|---|
1 | 0.8 |
2 | 0.6 |
I would like to predict y each time a batch is collected, I.E 2 predictions for a sensor with 2 reports.
LSTM model:
I've started with an LSTM model, since I'm trying to predict on a time-series of events.
My first thought was to select a fixed size input and to zero pad the input when the number of events collected is smaller than the input size.Then mask the padded value:
model.add(Masking(mask_value=0., input_shape=(num_samples, num_features)))
For example:
sensor_id | batch_id | timestamp | feature_1 | feature_n |
---|---|---|---|---|
1 | 1 | 2020-12-21T00:00:00+00:00 | 0.54 | 0.33 |
1 | 1 | 2020-12-21T01:00:00+00:00 | 0.23 | 0.14 |
Would produce the following input if selected length is 5:
[
[0.54, 0.33],
[0.23, 0.14],
[0,0],
[0,0],
[0,0]
]
However, the variance of number of events per sensor report in my train data is large, one report could collect 1000 events while the other one can collect 10. So if I'm selecting the average size (let's say 200), some inputs would be with a lot of padding, while other would be truncated and data will be lost.
I've heard about ragged tensors, but I'm not sure it fit my use case. How would one approach such a problem?
I don't have the specific of your model, but TF implementation of LSTM usually expect (batch, seq, features)
as input.
Now lest assume this is one of your batch_id:
data = np.zeros((15,5))
array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])
You could reshape it with (1, 15, 5
) and feed it to the model, but anytime your batch_id
length vary your sequence length will vary too and your model expect a fix sequence.
Instead you could reshape your data before training so that the batch_id length is passed as the batch size:
data = data[:,np.newaxis,:]
array([[[0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0.]]])
Same data, with shape (15, 1, 5) but your model would now be looking at a fix length of 1
and the number of sample would vary.
Make sure to reshape your label
as well.
To my knowledge, RNN and LSTM being applied for each time steps and state being reset between bacthes only this should not impact the model behavior.
Working with variable-sized input sequences is quite simple. While there is a restriction of having the same sized sequence within each batch, there is NO RESTRICTION of having variable-sized sequences between the batches
. Using this to your advantage, you can simply set the input sequence for the LSTM to (None, features)
and use batch_size
as 1.
Let's create a generator that generates variable-length sequences of 2 features and a random float score that you seek as a function of these sequences, similar to your input data for the sensors.
#Infinitely creates batches of dummy data
def generator():
while True:
length = np.random.randint(2, 10) #Variable length sequences
x_train = np.random.random((1, length, 2)) #batch, seq, features
y_train = np.random.random((1,1)) #batch, score
yield x_train, y_train
next(generator())
#x.shape = (1,4,2), y.shape = (1,1)
(array([[[0.63841991, 0.91141833],
[0.73131801, 0.92771373],
[0.61298585, 0.6455549 ],
[0.25893925, 0.40202978]]]),
array([[0.05934613]]))
Above is an example of a 4 length sequence created by the generator while the next is a 9 length one.
next(generator())
#x.shape = (1,9,2), y.shape = (1,1)
(array([[[0.76006158, 0.27457503],
[0.57739596, 0.75416962],
[0.03029365, 0.29339812],
[0.93866829, 0.79137367],
[0.52739961, 0.11475738],
[0.85832651, 0.19247399],
[0.37098216, 0.48703114],
[0.95846681, 0.15507787],
[0.86945015, 0.70949593]]]),
array([[0.02560889]]))
Now, let's create an LSTM based neural net that can work with these variable-sized sequences for each batch.
from tensorflow.keras import layers, Model, utils
inp = layers.Input((None, 2))
x = layers.LSTM(10, return_sequences=True)(inp)
x = layers.LSTM(10)(x)
out = layers.Dense(1)(x)
model = Model(inp, out)
utils.plot_model(model, show_layer_names=False, show_shapes=True)
Training these with a batch size of 1 -
model.compile(loss='binary_crossentropy', optimizer='adam')
model.fit(generator(), steps_per_epoch=100, epochs=10, batch_size=1)
#Steps_per_epoch is to stop the generator from generating infinite batches of data per epoch.
Epoch 1/10
100/100 [==============================] - 1s 5ms/step - loss: 1.5145
Epoch 2/10
100/100 [==============================] - 0s 5ms/step - loss: 0.7435
Epoch 3/10
100/100 [==============================] - 0s 4ms/step - loss: 0.7885
Epoch 4/10
100/100 [==============================] - 0s 4ms/step - loss: 0.7384
Epoch 5/10
100/100 [==============================] - 0s 4ms/step - loss: 0.7139
Epoch 6/10
100/100 [==============================] - 0s 5ms/step - loss: 0.7462
Epoch 7/10
100/100 [==============================] - 0s 4ms/step - loss: 0.7173
Epoch 8/10
100/100 [==============================] - 0s 4ms/step - loss: 0.7116
Epoch 9/10
100/100 [==============================] - 0s 4ms/step - loss: 0.6875
Epoch 10/10
100/100 [==============================] - 0s 4ms/step - loss: 0.7153
This is how you can work with variable-sized sequences as inputs. Padding/masking is only necessary for sequences that are part of the same batch.
Now, you could create a generator for your input data that generates one sequence of events as input to the model at one time, in which case you do not need to specify the batch_size
explicitly since you are generating one sequence at a time already.
Do not specify the batch_size if your data is in the form of datasets, generators, or keras.utils.Sequence instances (since they generate batches).
Or you could use the ragged tensors you were mentioning and provide a batch_size
of 1 for each sequence. Personally, I prefer working with generators for training data as it gives you a lot more flexibility in pre-processing as well.
Interestingly, you could optimize
this code further, but bundling batches of same length sequences together in a batch(es) and then passing a variable batch size
. This would help if you have tons of data and can't afford to run a batch_size of 1 for each gradient update!
Another word of caution! If your sequences are extremely long, then I would recommend using Truncated Backpropagation through time (TBPTT)
(find details here).
Hope this solves what you are looking for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With