Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn cross validation custom splits for time series data

I'd like to use scikit-learn's GridSearchCV to determine some hyper parameters for a random forest model. My data is time dependent and looks something like

import pandas as pd

train = pd.DataFrame({'date': pd.DatetimeIndex(['2012-1-1', '2012-9-30', '2013-4-3', '2014-8-16', '2015-3-20', '2015-6-30']), 
'feature1': [1.2, 3.3, 2.7, 4.0, 8.2, 6.5],
'feature2': [4, 4, 10, 3, 10, 9],
'target': [1,2,1,3,2,2]})

>>> train
        date  feature1  feature2  target
0 2012-01-01       1.2         4       1
1 2012-09-30       3.3         4       2
2 2013-04-03       2.7        10       1
3 2014-08-16       4.0         3       3
4 2015-03-20       8.2        10       2
5 2015-06-30       6.5         9       2

How can I implement the following cross validation folding technique?

train:(2012, 2013) - test:(2014)
train:(2013, 2014) - test:(2015)

That is, I want to use 2 years of historic observations to train a model and then test its accuracy in the subsequent year.

like image 577
Ben Avatar asked Jun 02 '16 05:06

Ben


People also ask

What is the use of cross validator in time series?

Time Series cross-validator Provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets. In each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate. This cross-validation object is a variation of KFold.

Why is cross-validation in scikit learn so expensive?

However, in SciKit Learn it explicitly tries all the possible combination which makes it computationally expensive. When cross-validation is used in the inner loop of the grid search, it is called grid search cross-validation. Hence, the optimization objective becomes minimizing the average loss obtained on the k folds.

What is timeseriessplit in scikit-learn?

Scikit-learn offers a function for time-series validation, TimeSeriesSplit. The function splits training data into multiple segments. We use the first segment to train the model with a set of hyper-parameter, to test it with the second.

What are the inputs for cross-validation splitting?

Determines the cross-validation splitting strategy. Possible inputs for cv are: An iterable yielding (train, test) splits as arrays of indices. For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, Fold is used.


1 Answers

There is also the TimeSeriesSplit function in sklearn, which splits time-series data (i.e. with fixed time intervals), in train/test sets. Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them, i.e. in each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate.

like image 145
mloning Avatar answered Sep 26 '22 06:09

mloning