Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LSTM preprocessing: Build 3d arrays from pandas data frame based on ID

I am new to machine learning in keras and I am planning to conduct a machine learning experiment that predicts a sequence of the first ten items bought in a video game match based on a recurrent neural network with lstm layer(s).

Suppose an examplary table pre-sorted by gameId,sideand timestamp is given:

       gameId   side   timestamp  itemId 
   3030038208    100        4260    1055 
   3030038208    100        4648    2010 
   3030038208    100        5036    3340 
   3030038208    100      291561    1001 
   3030038208    100      295807    1083 
   3030038208    100      296457    2010 
   3030038208    200        3257    1055 
   3030038208    200        3516    2003 
   3030038208    200        3775    3340 
   3030038208    200      321461    1038 
   3030038208    200      321818    2003 
   3030038208    200      321979    2003 
   3030038208    200      491099    3006 
   3030038208    200      492238    1042 
   3030038208    200      743864    3086 
   3030038208    200      744773    1043
         ....

I now would like to reshape the dataframe into a two (x and y) 3d numpy array in which the third dimension describes the length of the purchase-sequence (ItemId) - such that essentially every 2d numpy array in the resulting sequence constitutes a table for the same gameId, side pair

Before training the Neural Network I would also need to insert a padding since the time series as mentioned above would be 10. In this example a padding value of 0 would seem alright, however in the real scenario I am working with a sparse matrix that includes a lot of 0 values.

Now here are some questions:

1) are there any built-in functions for either numpy, pandas or even keras to efficiently achieve my stated goals. I can't think of something that wouldn't take me ages to come up with a sensible preprocessing function.

2) are there any other considerations that need to be taken care of? especially in the case of padding. Would filling in "-999" not make more sense when dealing with sparse matrices?

3) Suppose the model would look something like that

model = Sequential()
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2, input_dim=1))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, verbose=0, mode='auto')
checkpointer = ModelCheckpoint(filepath=filepath + "best_weights.hdf5", verbose=0, save_best_only=True)

With:

history = model.fit(x_train, y_train, epochs=2, validation_split=0.33, callbacks=[monitor, checkpointer], verbose=0).history

How would I be able to employ a masking layer correctly that takes care of the padding?

Thanks in advance for any second spent on that thread!

edit: on request, here are the resulting numpy arrays (I think) I would like to get in order to predict the itemId based on the timestamp with a neural network with lstm layers in keras before padding:

y = [
[1055, 2010, 3340, 1001, 1083, 2010],
[1055, 2003, 3340, 1038, 2003, 2003, 3006, 1042, 3086, 1043],
...
]

x = [
[[4260], [4648], [5036], [291561], [295807], [296457]],
[[3257], [3516], [3775], [321461], [321818], [321979], [491099], [492238], [743864], [744773] ],
...
]

and after padding:

y = [
[1055, 2010, 3340, 1001, 1083, 2010, 0, 0, 0, 0],
[1055, 2003, 3340, 1038, 2003, 2003, 3006, 1042, 3086, 1043],
...
]

x = [
[[4260], [4648], [5036], [291561], [295807], [296457], [0], [0], [0], [0]],
[[3257], [3516], [3775], [321461], [321818], [321979], [491099], [492238], [743864], [744773] ],
...
]

However, there will be more features than just timestamp in the real example.

like image 760
DwayneHart Avatar asked Apr 12 '18 18:04

DwayneHart


1 Answers

You can achieve this with a few steps by extracting data from a pandas groupby object. In the first two steps we will create the groupby object so that we can operate on it later on in the code. From the groupby object, we will find the largest group, so that that we can pad with zeros accordingly

gb = df.groupby(['gameId','side']) # Create Groupby object
mx = gb['side'].size().max() # Find the largest group

The steps for creating x & y are very similar. We can use list comprehension to loop over each group, convert the dataframes into numpy arrays and pad with zeros using np.pad(). Then reshape each array to be 3d

x = np.array([np.pad(frame['timestamp'].values,
                     pad_width=(0,mx-len(frame)),
                     mode='constant',
                     constant_values=0) 
                     for _,frame in gb]).reshape(-1,mx,1)

y = np.array([np.pad(frame['itemId'].values,
                     pad_width=(0,mx-len(frame)),
                     mode='constant',
                     constant_values=0) 
                     for _,frame in gb]).reshape(-1,mx,1)

In this example, the setup is for a many-to-many lstm. In the comments I had pointed out that your current setup would not support a 3d output value, because in the lstm layer you did not have the argument return_sequence=True.

Its unclear which structure you are looking for in this problem. I like to consult the following image when deciding which LSTM network I am using. The code Above will support a many-to-many network, assuming you add return_sequence=True to your LSTM layer. If you wanted many-to-one instead, drop .reshape(-1,mx,1) from y, and now you have a network with mx outputs.

enter image description here


For either setup, you need to modify the input_shape argument for your model. This argument must specify the shape of your 2nd and 3rd dimensions of x i.e.

                                                        # v Use input_shape here
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2, input_shape=x.shape[1:]))
like image 151
DJK Avatar answered Nov 09 '22 05:11

DJK