Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn: User defined cross validation for time series data

I'm trying to solve a machine learning problem. I have a specific dataset with time-series element. For this problem I'm using well-known python library - sklearn. There are a lot of cross validation iterators in this library. Also there are several iterators for defining cross validation yourself. The problem is that I don't really know how to define simple cross validation for time series. Here is a good example of what I'm trying to get:

Suppose we have several periods (years) and we want to split our data set into several chunks as follows:

data = [1, 2, 3, 4, 5, 6, 7]

train: [1]                test: [2] (or test: [2, 3, 4, 5, 6, 7])
train: [1, 2]             test: [3] (or test: [3, 4, 5, 6, 7])
train: [1, 2, 3]          test: [4] (or test: [4, 5, 6, 7])
...
train: [1, 2, 3, 4, 5, 6] test: [7]

I can't really understand how to create this kind of cross validation with sklearn tools. Probably I should use PredefinedSplit from sklearn.cross_validation like that:

train_fraction  = 0.8
train_size      = int(train_fraction * X_train.shape[0])
validation_size = X_train.shape[0] - train_size

cv_split = cross_validation.PredefinedSplit(test_fold=[-1] * train_size + [1] * validation_size)

Result:

train: [1, 2, 3, 4, 5] test: [6, 7]

But still it's not so good as a previous data split

like image 994
Demyanov Avatar asked Nov 25 '15 23:11

Demyanov


2 Answers

You can obtain the desired cross-validation splits without using sklearn. Here's an example

import numpy as np

from sklearn.svm import SVR
from sklearn.feature_selection import RFECV

# Generate some data.
N = 10
X_train = np.random.randn(N, 3)
y_train = np.random.randn(N)

# Define the splits.
idxs = np.arange(N)
cv_splits = [(idxs[:i], idxs[i:]) for i in range(1, N)]

# Create the RFE object and compute a cross-validated score.
svr = SVR(kernel="linear")
rfecv = RFECV(estimator=svr, step=1, cv=cv_splits)
rfecv.fit(X_train, y_train)
like image 73
Dan Oneață Avatar answered Oct 11 '22 08:10

Dan Oneață


Meanwhile this was added to the library: http://scikit-learn.org/stable/modules/cross_validation.html#time-series-split

Example from the doc:

>>> from sklearn.model_selection import TimeSeriesSplit

>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4, 5, 6])
>>> tscv = TimeSeriesSplit(n_splits=3)
>>> print(tscv)  
TimeSeriesSplit(n_splits=3)
>>> for train, test in tscv.split(X):
...     print("%s %s" % (train, test))
[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]
like image 22
Marcus V. Avatar answered Oct 11 '22 08:10

Marcus V.