I am new to machine learning in keras and I am planning to conduct a machine learning experiment that predicts a sequence of the first ten items bought in a video game match based on a recurrent neural network with lstm layer(s).
Suppose an examplary table pre-sorted by gameId
,side
and timestamp
is given:
gameId side timestamp itemId
3030038208 100 4260 1055
3030038208 100 4648 2010
3030038208 100 5036 3340
3030038208 100 291561 1001
3030038208 100 295807 1083
3030038208 100 296457 2010
3030038208 200 3257 1055
3030038208 200 3516 2003
3030038208 200 3775 3340
3030038208 200 321461 1038
3030038208 200 321818 2003
3030038208 200 321979 2003
3030038208 200 491099 3006
3030038208 200 492238 1042
3030038208 200 743864 3086
3030038208 200 744773 1043
....
I now would like to reshape the dataframe into a two (x and y) 3d numpy array in which the third dimension describes the length of the purchase-sequence (ItemId
) - such that essentially every 2d numpy array in the resulting sequence constitutes a table for the same gameId
, side
pair
Before training the Neural Network I would also need to insert a padding since the time series as mentioned above would be 10. In this example a padding value of 0 would seem alright, however in the real scenario I am working with a sparse matrix that includes a lot of 0 values.
Now here are some questions:
1) are there any built-in functions for either numpy, pandas or even keras to efficiently achieve my stated goals. I can't think of something that wouldn't take me ages to come up with a sensible preprocessing function.
2) are there any other considerations that need to be taken care of? especially in the case of padding. Would filling in "-999" not make more sense when dealing with sparse matrices?
3) Suppose the model would look something like that
model = Sequential()
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2, input_dim=1))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, verbose=0, mode='auto')
checkpointer = ModelCheckpoint(filepath=filepath + "best_weights.hdf5", verbose=0, save_best_only=True)
With:
history = model.fit(x_train, y_train, epochs=2, validation_split=0.33, callbacks=[monitor, checkpointer], verbose=0).history
How would I be able to employ a masking layer correctly that takes care of the padding?
Thanks in advance for any second spent on that thread!
edit:
on request, here are the resulting numpy arrays (I think) I would like to get in order to predict the itemId
based on the timestamp
with a neural network with lstm layers in keras before padding:
y = [
[1055, 2010, 3340, 1001, 1083, 2010],
[1055, 2003, 3340, 1038, 2003, 2003, 3006, 1042, 3086, 1043],
...
]
x = [
[[4260], [4648], [5036], [291561], [295807], [296457]],
[[3257], [3516], [3775], [321461], [321818], [321979], [491099], [492238], [743864], [744773] ],
...
]
and after padding:
y = [
[1055, 2010, 3340, 1001, 1083, 2010, 0, 0, 0, 0],
[1055, 2003, 3340, 1038, 2003, 2003, 3006, 1042, 3086, 1043],
...
]
x = [
[[4260], [4648], [5036], [291561], [295807], [296457], [0], [0], [0], [0]],
[[3257], [3516], [3775], [321461], [321818], [321979], [491099], [492238], [743864], [744773] ],
...
]
However, there will be more features than just timestamp in the real example.
You can achieve this with a few steps by extracting data from a pandas groupby object. In the first two steps we will create the groupby object so that we can operate on it later on in the code. From the groupby object, we will find the largest group, so that that we can pad with zeros accordingly
gb = df.groupby(['gameId','side']) # Create Groupby object
mx = gb['side'].size().max() # Find the largest group
The steps for creating x & y are very similar. We can use list comprehension to loop over each group, convert the dataframes into numpy arrays and pad with zeros using np.pad()
. Then reshape each array to be 3d
x = np.array([np.pad(frame['timestamp'].values,
pad_width=(0,mx-len(frame)),
mode='constant',
constant_values=0)
for _,frame in gb]).reshape(-1,mx,1)
y = np.array([np.pad(frame['itemId'].values,
pad_width=(0,mx-len(frame)),
mode='constant',
constant_values=0)
for _,frame in gb]).reshape(-1,mx,1)
In this example, the setup is for a many-to-many lstm. In the comments I had pointed out that your current setup would not support a 3d output value, because in the lstm layer you did not have the argument return_sequence=True
.
Its unclear which structure you are looking for in this problem. I like to consult the following image when deciding which LSTM network I am using. The code Above will support a many-to-many network, assuming you add return_sequence=True
to your LSTM layer. If you wanted many-to-one instead, drop .reshape(-1,mx,1)
from y, and now you have a network with mx
outputs.
For either setup, you need to modify the input_shape
argument for your model. This argument must specify the shape of your 2nd and 3rd dimensions of x i.e.
# v Use input_shape here
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2, input_shape=x.shape[1:]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With