I'd like to use scikit-learn's GridSearchCV to determine some hyper parameters for a random forest model. My data is time dependent and looks something like
import pandas as pd
train = pd.DataFrame({'date': pd.DatetimeIndex(['2012-1-1', '2012-9-30', '2013-4-3', '2014-8-16', '2015-3-20', '2015-6-30']),
'feature1': [1.2, 3.3, 2.7, 4.0, 8.2, 6.5],
'feature2': [4, 4, 10, 3, 10, 9],
'target': [1,2,1,3,2,2]})
>>> train
date feature1 feature2 target
0 2012-01-01 1.2 4 1
1 2012-09-30 3.3 4 2
2 2013-04-03 2.7 10 1
3 2014-08-16 4.0 3 3
4 2015-03-20 8.2 10 2
5 2015-06-30 6.5 9 2
How can I implement the following cross validation folding technique?
train:(2012, 2013) - test:(2014)
train:(2013, 2014) - test:(2015)
That is, I want to use 2 years of historic observations to train a model and then test its accuracy in the subsequent year.
Time Series cross-validator Provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets. In each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate. This cross-validation object is a variation of KFold.
However, in SciKit Learn it explicitly tries all the possible combination which makes it computationally expensive. When cross-validation is used in the inner loop of the grid search, it is called grid search cross-validation. Hence, the optimization objective becomes minimizing the average loss obtained on the k folds.
Scikit-learn offers a function for time-series validation, TimeSeriesSplit. The function splits training data into multiple segments. We use the first segment to train the model with a set of hyper-parameter, to test it with the second.
Determines the cross-validation splitting strategy. Possible inputs for cv are: An iterable yielding (train, test) splits as arrays of indices. For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, Fold is used.
There is also the TimeSeriesSplit function in sklearn
, which splits time-series data (i.e. with fixed time intervals), in train/test sets. Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them, i.e. in each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With