I work with panel data: I observe a number of units (e.g. people) over time; for each unit, I have records for the same fixed time intervals.
When splitting the data into train and test sets, we need to make sure that both sets are disjoint and sequential, i.e. the latest records in the train set should be before the earliest records in the test set (see e.g. this blog post).
Is there any standard Python implementation of cross-validation for panel data?
I've tried Scikit-Learn's TimeSeriesSplit, which cannot account for groups, and GroupShuffleSplit which cannot account for the sequential nature of the data, see code below.
import pandas as pd
import numpy as np
from sklearn.model_selection import GroupShuffleSplit, TimeSeriesSplit
# generate panel data
user = np.repeat(np.arange(10), 12)
time = np.tile(pd.date_range(start='2018-01-01', periods=12, freq='M'), 10)
data = (pd.DataFrame({'user': user, 'time': time})
.sort_values(['time', 'user'])
.reset_index(drop=True))
tscv = TimeSeriesSplit(n_splits=4)
for train_idx, test_idx in tscv.split(data):
train = data.iloc[train_idx]
test = data.iloc[test_idx]
train_end = train.time.max().date()
test_start = test.time.min().date()
print('TRAIN:', train_end, '\tTEST:', test_start, '\tSequential:', train_end < test_start, sep=' ')
Output:
TRAIN: 2018-03-31 TEST: 2018-03-31 Sequential: False
TRAIN: 2018-05-31 TEST: 2018-05-31 Sequential: False
TRAIN: 2018-08-31 TEST: 2018-08-31 Sequential: False
TRAIN: 2018-10-31 TEST: 2018-10-31 Sequential: False
So, in this example, I want the train and test sets to still be sequential.
There are a number of related, older posts, but with no (convincing) answer, see e.g.
https://stackoverflow.com/questions/51861417/time-series-prediction-for-grouped-data [now deleted]
Stratified Cross validation of timeseries data
The method that can be used for cross-validating the time-series model is cross-validation on a rolling basis. Start with a small subset of data for training purpose, forecast for the later data points and then checking the accuracy for the forecasted data points.
So, rather than use k-fold cross-validation, for time series data we utilize hold-out cross-validation where a subset of the data (split temporally) is reserved for validating the model performance.
A more sophisticated version of training/test sets is time series cross-validation. In this procedure, there are a series of test sets, each consisting of a single observation. The corresponding training set consists only of observations that occurred prior to the observation that forms the test set.
Forward-chaining cross-validation, also called rolling-origin cross-validation, is similar to k-fold but suited to sequential data such as time series. There is no random shuffling of data to begin but a test set may be set aside.
This feature was requested on scikit-learn and I have added a PR for it . The code is awaiting review at this point. This code was used with some good results on a recent Kaggle competition .
gap
parameter between different groups . Feature Request for the same has been raised on Scikit-learn .from sklearn.model_selection._split import _BaseKFold, indexable, _num_samples
from sklearn.utils.validation import _deprecate_positional_args
# https://github.com/getgaurav2/scikit-learn/blob/d4a3af5cc9da3a76f0266932644b884c99724c57/sklearn/model_selection/_split.py#L2243
class GroupTimeSeriesSplit(_BaseKFold):
"""Time Series cross-validator variant with non-overlapping groups.
Provides train/test indices to split time series data samples
that are observed at fixed time intervals according to a
third-party provided group.
In each split, test indices must be higher than before, and thus shuffling
in cross validator is inappropriate.
This cross-validation object is a variation of :class:`KFold`.
In the kth split, it returns first k folds as train set and the
(k+1)th fold as test set.
The same group will not appear in two different folds (the number of
distinct groups has to be at least equal to the number of folds).
Note that unlike standard cross-validation methods, successive
training sets are supersets of those that come before them.
Read more in the :ref:`User Guide <cross_validation>`.
Parameters
----------
n_splits : int, default=5
Number of splits. Must be at least 2.
max_train_size : int, default=None
Maximum size for a single training set.
Examples
--------
>>> import numpy as np
>>> from sklearn.model_selection import GroupTimeSeriesSplit
>>> groups = np.array(['a', 'a', 'a', 'a', 'a', 'a',\
'b', 'b', 'b', 'b', 'b',\
'c', 'c', 'c', 'c',\
'd', 'd', 'd'])
>>> gtss = GroupTimeSeriesSplit(n_splits=3)
>>> for train_idx, test_idx in gtss.split(groups, groups=groups):
... print("TRAIN:", train_idx, "TEST:", test_idx)
... print("TRAIN GROUP:", groups[train_idx],\
"TEST GROUP:", groups[test_idx])
TRAIN: [0, 1, 2, 3, 4, 5] TEST: [6, 7, 8, 9, 10]
TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a']\
TEST GROUP: ['b' 'b' 'b' 'b' 'b']
TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] TEST: [11, 12, 13, 14]
TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b']\
TEST GROUP: ['c' 'c' 'c' 'c']
TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]\
TEST: [15, 16, 17]
TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c']\
TEST GROUP: ['d' 'd' 'd']
"""
@_deprecate_positional_args
def __init__(self,
n_splits=5,
*,
max_train_size=None
):
super().__init__(n_splits, shuffle=False, random_state=None)
self.max_train_size = max_train_size
def split(self, X, y=None, groups=None):
"""Generate indices to split data into training and test set.
Parameters
----------
X : array-like of shape (n_samples, n_features)
Training data, where n_samples is the number of samples
and n_features is the number of features.
y : array-like of shape (n_samples,)
Always ignored, exists for compatibility.
groups : array-like of shape (n_samples,)
Group labels for the samples used while splitting the dataset into
train/test set.
Yields
------
train : ndarray
The training set indices for that split.
test : ndarray
The testing set indices for that split.
"""
if groups is None:
raise ValueError(
"The 'groups' parameter should not be None")
X, y, groups = indexable(X, y, groups)
n_samples = _num_samples(X)
n_splits = self.n_splits
n_folds = n_splits + 1
group_dict = {}
u, ind = np.unique(groups, return_index=True)
unique_groups = u[np.argsort(ind)]
n_samples = _num_samples(X)
n_groups = _num_samples(unique_groups)
for idx in np.arange(n_samples):
if (groups[idx] in group_dict):
group_dict[groups[idx]].append(idx)
else:
group_dict[groups[idx]] = [idx]
if n_folds > n_groups:
raise ValueError(
("Cannot have number of folds={0} greater than"
" the number of groups={1}").format(n_folds,
n_groups))
group_test_size = n_groups // n_folds
group_test_starts = range(n_groups - n_splits * group_test_size,
n_groups, group_test_size)
for group_test_start in group_test_starts:
train_array = []
test_array = []
for train_group_idx in unique_groups[:group_test_start]:
train_array_tmp = group_dict[train_group_idx]
train_array = np.sort(np.unique(
np.concatenate((train_array,
train_array_tmp)),
axis=None), axis=None)
train_end = train_array.size
if self.max_train_size and self.max_train_size < train_end:
train_array = train_array[train_end -
self.max_train_size:train_end]
for test_group_idx in unique_groups[group_test_start:
group_test_start +
group_test_size]:
test_array_tmp = group_dict[test_group_idx]
test_array = np.sort(np.unique(
np.concatenate((test_array,
test_array_tmp)),
axis=None), axis=None)
yield [int(i) for i in train_array], [int(i) for i in test_array]
Example with GridSearchCV . Code modified from SO post here.
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
import numpy as np
groups = np.array(['a', 'a', 'a', 'b', 'b', 'c'])
X = np.array([[4, 5, 6, 1, 0, 2], [3.1, 3.5, 1.0, 2.1, 8.3, 1.1]]).T
y = np.array([1, 6, 7, 1, 2, 3])
model = xgb.XGBRegressor()
param_search = {'max_depth' : [3, 5]}
tscv = GroupTimeSeriesSplit(n_splits=2)
gsearch = GridSearchCV(estimator=model, cv=tscv,
param_grid=param_search)
gsearch.fit(X, y , groups=groups)
I have recently hit the same task and after I failed to find appropriate solution I decided to write my own class which is a tweaked version of scikit-learn TimeSeriesSplit
implementation. Therefore, I'll leave here for whoever comes later looking for the solution.
The idea is basically to sort the data
by time
, group the observations according to time
variable and then just build cross-validator the same way TimeSeriesSplit
does, but on the newly formed groups of observations.
import numpy as np
from sklearn.utils import indexable
from sklearn.utils.validation import _num_samples
from sklearn.model_selection._split import _BaseKFold
class GroupTimeSeriesSplit(_BaseKFold):
"""
Time Series cross-validator for a variable number of observations within the time
unit. In the kth split, it returns first k folds as train set and the (k+1)th fold
as test set. Indices can be grouped so that they enter the CV fold together.
Parameters
----------
n_splits : int, default=5
Number of splits. Must be at least 2.
max_train_size : int, default=None
Maximum size for a single training set.
"""
def __init__(self, n_splits=5, *, max_train_size=None):
super().__init__(n_splits, shuffle=False, random_state=None)
self.max_train_size = max_train_size
def split(self, X, y=None, groups=None):
"""
Generate indices to split data into training and test set.
Parameters
----------
X : array-like of shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is
the number of features.
y : array-like of shape (n_samples,)
Always ignored, exists for compatibility.
groups : array-like of shape (n_samples,)
Group labels for the samples used while splitting the dataset into
train/test set.
Most often just a time feature.
Yields
-------
train : ndarray
The training set indices for that split.
test : ndarray
The testing set indices for that split.
"""
n_splits = self.n_splits
X, y, groups = indexable(X, y, groups)
n_samples = _num_samples(X)
n_folds = n_splits + 1
indices = np.arange(n_samples)
group_counts = np.unique(groups, return_counts=True)[1]
groups = np.split(indices, np.cumsum(group_counts)[:-1])
n_groups = _num_samples(groups)
if n_folds > n_groups:
raise ValueError(
("Cannot have number of folds ={0} greater"
" than the number of groups: {1}.").format(n_folds, n_groups))
test_size = (n_groups // n_folds)
test_starts = range(test_size + n_groups % n_folds,
n_groups, test_size)
for test_start in test_starts:
if self.max_train_size:
train_start = np.searchsorted(
np.cumsum(
group_counts[:test_start][::-1])[::-1] < self.max_train_size + 1,
True)
yield (np.concatenate(groups[train_start:test_start]),
np.concatenate(groups[test_start:test_start + test_size]))
else:
yield (np.concatenate(groups[:test_start]),
np.concatenate(groups[test_start:test_start + test_size]))
And applying it to OP example we get:
gtscv = GroupTimeSeriesSplit(n_splits=3)
for split_id, (train_id, val_id) in enumerate(gtscv.split(data, groups=data["time"])):
print("Split id: ", split_id, "\n")
print("Train id: ", train_id, "\n", "Validation id: ", val_id)
print("Train dates: ", data.loc[train_id, "time"].unique(), "\n", "Validation dates: ", data.loc[val_id, "time"].unique(), "\n")
Split id: 0
Train id: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29]
Validation id: [30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
54 55 56 57 58 59]
Train dates: ['2018-01-31T00:00:00.000000000' '2018-02-28T00:00:00.000000000'
'2018-03-31T00:00:00.000000000']
Validation dates: ['2018-04-30T00:00:00.000000000' '2018-05-31T00:00:00.000000000'
'2018-06-30T00:00:00.000000000']
Split id: 1
Train id: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59]
Validation id: [60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
84 85 86 87 88 89]
Train dates: ['2018-01-31T00:00:00.000000000' '2018-02-28T00:00:00.000000000'
'2018-03-31T00:00:00.000000000' '2018-04-30T00:00:00.000000000'
'2018-05-31T00:00:00.000000000' '2018-06-30T00:00:00.000000000']
Validation dates: ['2018-07-31T00:00:00.000000000' '2018-08-31T00:00:00.000000000'
'2018-09-30T00:00:00.000000000']
Split id: 2
Train id: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89]
Validation id: [ 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107
108 109 110 111 112 113 114 115 116 117 118 119]
Train dates: ['2018-01-31T00:00:00.000000000' '2018-02-28T00:00:00.000000000'
'2018-03-31T00:00:00.000000000' '2018-04-30T00:00:00.000000000'
'2018-05-31T00:00:00.000000000' '2018-06-30T00:00:00.000000000'
'2018-07-31T00:00:00.000000000' '2018-08-31T00:00:00.000000000'
'2018-09-30T00:00:00.000000000']
Validation dates: ['2018-10-31T00:00:00.000000000' '2018-11-30T00:00:00.000000000'
'2018-12-31T00:00:00.000000000']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With