Scikit-learn utilizes a very convenient approach based on fit
and predict
methods. I have time-series data in the format suited for fit
and predict
.
For example I have the following Xs
:
[[1.0, 2.3, 4.5], [6.7, 2.7, 1.2], ..., [3.2, 4.7, 1.1]]
and the corresponding ys
:
[[1.0], [2.3], ..., [7.7]]
These data have the following meaning. The values stored in ys
form a time series. The values in Xs
are corresponding time dependent "factors" that are known to have some influence on the values in ys
(for example: temperature, humidity and atmospheric pressure).
Now, of course, I can use fit(Xs,ys)
. But then I get a model in which future values in ys
depend only on factors and do not dependend on the previous Y
values (at least directly) and this is a limitation of the model. I would like to have a model in which Y_n
depends also on Y_{n-1}
and Y_{n-2}
and so on. For example I might want to use an exponential moving average as a model. What is the most elegant way to do it in scikit-learn
ADDED
As it has been mentioned in the comments, I can extend Xs
by adding ys
. But this way has some limitations. For example, if I add the last 5 values of y
as 5 new columns to X
, the information about time ordering of ys
is lost. For example, there is no indication in X
that values in the 5th column follows value in the 4th column and so on. As a model, I might want to have a linear fit of the last five ys
and use the found linear function to make a prediction. But if I have 5 values in 5 columns it is not so trivial.
ADDED 2
To make my problem even more clear, I would like to give one concrete example. I would like to have a "linear" model in which y_n = c + k1*x1 + k2*x2 + k3*x3 + k4*EMOV_n
, where EMOV_n is just an exponential moving average. How, can I implement this simple model in scikit-learn?
Sklearn 'Predict' syntaxAfter you've initialized and trained the model, you can call the predict method using “dot” syntax: Inside the parenthesis of the method, you provide the name of the new input data (i.e., the features of the test dataset. This dataset is commonly called X_test .
predict() : given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model. predict(X_new) ), and returns the learned label for each object in the array.
According to Wikipedia, EWMA works well with stationary data, but it does not work as expected in the presence of trends, or seasonality. In those cases you should use a second or third order EWMA method, respectively. I decided to look at the pandas ewma
function to see how it handled trends, and this is what I came up with:
import pandas, numpy as np ewma = pandas.stats.moments.ewma # make a hat function, and add noise x = np.linspace(0,1,100) x = np.hstack((x,x[::-1])) x += np.random.normal( loc=0, scale=0.1, size=200 ) plot( x, alpha=0.4, label='Raw' ) # take EWMA in both directions with a smaller span term fwd = ewma( x, span=15 ) # take EWMA in fwd direction bwd = ewma( x[::-1], span=15 ) # take EWMA in bwd direction c = np.vstack(( fwd, bwd[::-1] )) # lump fwd and bwd together c = np.mean( c, axis=0 ) # average # regular EWMA, with bias against trend plot( ewma( x, span=20 ), 'b', label='EWMA, span=20' ) # "corrected" (?) EWMA plot( c, 'r', label='Reversed-Recombined' ) legend(loc=8) savefig( 'ewma_correction.png', fmt='png', dpi=100 )
As you can see, the EWMA bucks the trend uphill and downhill. We can correct for this (without having to implement a second-order scheme ourselves) by taking the EWMA in both directions and then averaging. I hope your data was stationary!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With