Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

machine learning-how to use the past 20 rows as an input for X for each Y value

I have a very simple machine learning code here:

# load dataset
dataframe = pandas.read_csv("USDJPY,5.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:59]
Y = dataset[:,59]
#fit Dense Keras model
model.fit(X, Y, validation_data=(x,y_test), epochs=150, batch_size=10)

My X values are 59 features with the 60th column being my Y value, a simple 1 or 0 classification label.

Considering that I am using financial data, I would like to lookback the past 20 X values in order to predict the Y value.

So how could I make my algorithm use the past 20 rows as an input for X for each Y value?

I'm relatively new to machine learning and spent much time looking online for a solution to my problem yet I could not find anything simple as my case.

Any ideas?

like image 253
xion Avatar asked Aug 18 '17 20:08

xion


2 Answers

This is typically done with Recurrent Neural Networks (RNN), that retain some memory of the previous input, when the next input is received. Thats a very breif explanation of what goes on, but there are plenty of sources on the internet to better wrap your understanding of how they work.

Lets break this down in a simple example. Lets say you have 5 samples and 5 features of data, and you want two stagger the data by 2 rows instead of 20. Here is your data (assuming 1 stock and the oldest price value is first). And we can think of each row as a day of the week

ar = np.random.randint(10,100,(5,5))

[[43, 79, 67, 20, 13],    #<---Monday---
 [80, 86, 78, 76, 71],    #<---Tuesday---
 [35, 23, 62, 31, 59],    #<---Wednesday---
 [67, 53, 92, 80, 15],    #<---Thursday---
 [60, 20, 10, 45, 47]]    #<---Firday---

To use an LSTM in keras, your data needs to be 3-D, vs the current 2-D structure it is now, and the notation for each diminsion is (samples,timesteps,features). Currently you only have (samples,features) so you would need to augment the data.

a2 = np.concatenate([ar[x:x+2,:] for x in range(ar.shape[0]-1)])
a2 = a2.reshape(4,2,5)

[[[43, 79, 67, 20, 13],    #See Monday First
  [80, 86, 78, 76, 71]],   #See Tuesday second ---> Predict Value originally set for Tuesday
 [[80, 86, 78, 76, 71],    #See Tuesday First
  [35, 23, 62, 31, 59]],   #See Wednesday Second ---> Predict Value originally set for Wednesday
 [[35, 23, 62, 31, 59],    #See Wednesday Value First
  [67, 53, 92, 80, 15]],   #See Thursday Values Second ---> Predict value originally set for Thursday
 [[67, 53, 92, 80, 15],    #And so on
  [60, 20, 10, 45, 47]]])

Notice how the data is staggered and 3 dimensional. Now just make an LSTM network. Y remains 2-D since this is a many-to-one structure, however you need to clip the first value.

model = Sequential()
model.add(LSTM(hidden_dims,input_shape=(a2.shape[1],a2.shape[2]))
model.add(Dense(1))

This is just a brief example to get you moving. There are many different setups that will work (including not using RNN), you need to find the correct one for your data.

like image 192
DJK Avatar answered Oct 13 '22 21:10

DJK


This seems to be a time series type of task.
I would start by looking at Recurrent Neural Networks keras

If you want to keep using the modeling you have. (I would not recommend) For time series you may want to transform your data set to some kind of weighted average of last 20 observations (rows).
This way, each of your new data set's observations is the function of the previous 20. This way, that information is present for classification.

You can use something like this for each column if you want the runing sum:

import numpy as np

def running_sum(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) 

x=np.random.rand(200)

print(running_sum(x,20))

Alternately, you could pivot your current data set so that each row has the actual numbers: Add 19 x dimension count columns. Populate with previous observation's data in those. Whether this is possible or practical depends on the shape of your data set.

This is a simple, not too thorough, way to make sure each observation has the data that you think will make a good prediction. You need to be aware of these things:

  1. The modelling method is 'ok' with this not absolute independence of observation.
  2. When you make the prediction for X[i], you have all the information from X[i-20] to X[i-1]

I'm sure there are other considerations that make this not optimal, and am suggesting to use dedicated RNN.

I am aware that djk already pointed out this is RNN, I'm posting this after that answer was accepted per OP's request.

like image 24
AChervony Avatar answered Oct 13 '22 22:10

AChervony