Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Explicitly specifying test/train sets in GridSearchCV

I have a question about the cv parameter of sklearn's GridSearchCV.

I'm working with data that has a time component to it, so I don't think random shuffling within KFold cross-validation seems sensible.

Instead, I want to explicitly specify cutoffs for training, validation, and test data within a GridSearchCV. Can I do this?

To better illuminate the question, here's how I would to that manually.

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
np.random.seed(444)

index = pd.date_range('2014', periods=60, freq='M')
X, y = make_regression(n_samples=60, n_features=3, random_state=444, noise=90.)
X = pd.DataFrame(X, index=index, columns=list('abc'))
y = pd.Series(y, index=index, name='y')

# Train on the first 30 samples, validate on the next 10, test on
#     the final 10.
X_train, X_val, X_test = np.array_split(X, [35, 50])
y_train, y_val, y_test = np.array_split(y, [35, 50])

param_grid = {'alpha': np.linspace(0, 1, 11)}
model = None
best_param_ = None
best_score_ = -np.inf

# Manual implementation
for alpha in param_grid['alpha']:
    ridge = Ridge(random_state=444, alpha=alpha).fit(X_train, y_train)
    score = ridge.score(X_val, y_val)
    if score > best_score_:
        best_score_ = score
        best_param_ = alpha
        model = ridge

print('Optimal alpha parameter: {:0.2f}'.format(best_param_))
print('Best score (on validation data): {:0.2f}'.format(best_score_))
print('Test set score: {:.2f}'.format(model.score(X_test, y_test)))
# Optimal alpha parameter: 1.00
# Best score (on validation data): 0.64
# Test set score: 0.22

The process here is:

  • For both X and Y, I want a training set, validation set, and testing set. The training set is the first 35 samples in the time series. The validation set is the next 15 samples. The test set is the final 10.
  • The train and validation sets are use to determine the optimal alpha parameter within Ridge regression. Here I test alphas of (0.0, 0.1, ..., 0.9, 1.0).
  • The test set is held out for the "actual" testing as unseen data.

Anyways ... It seems like I'm looking to do something like this, but am not sure what to pass to cv here:

from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(Ridge(random_state=444), param_grid, cv= ???)
grid_search.fit(...?)

The docs, which I'm having trouble interpreting, specify:

cv : int, cross-validation generator or an iterable, optional

Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 3-fold cross validation,
  • integer, to specify the number of folds in a (Stratified)KFold,
  • An object to be used as a cross-validation generator.
  • An iterable yielding train, test splits.

For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

like image 759
Brad Solomon Avatar asked Jan 22 '18 21:01

Brad Solomon


Video Answer


3 Answers

As @MaxU said, its better to let the GridSearchCV handle the splits, but if you want to enforce the splitting as you have set in the question, then you can use the PredefinedSplit which does this very thing.

So you need to make the following changes to your code.

# Here X_test, y_test is the untouched data
# Validation data (X_val, y_val) is currently inside X_train, which will be split using PredefinedSplit inside GridSearchCV
X_train, X_test = np.array_split(X, [50])
y_train, y_test = np.array_split(y, [50])


# The indices which have the value -1 will be kept in train.
train_indices = np.full((35,), -1, dtype=int)

# The indices which have zero or positive values, will be kept in test
test_indices = np.full((15,), 0, dtype=int)
test_fold = np.append(train_indices, test_indices)

print(test_fold)
# OUTPUT: 
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0])

from sklearn.model_selection import PredefinedSplit
ps = PredefinedSplit(test_fold)

# Check how many splits will be done, based on test_fold
ps.get_n_splits()
# OUTPUT: 1

for train_index, test_index in ps.split():
    print("TRAIN:", train_index, "TEST:", test_index)

# OUTPUT: 
('TRAIN:', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
   17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
   34]), 
 'TEST:', array([35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]))


# And now, send this `ps` to cv param in GridSearchCV
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(Ridge(random_state=444), param_grid, cv=ps)

# Here, send the X_train and y_train
grid_search.fit(X_train, y_train)

The X_train, y_train sent to fit() will be split into train and test (val in your case) using the split we defined and hence, the Ridge will be trained on original data from indices [0:35] and tested on [35:50].

Hope this clears the working.

like image 115
Vivek Kumar Avatar answered Nov 07 '22 19:11

Vivek Kumar


Have you tried TimeSeriesSplit?

It was made explicitly for splitting time series data.

tscv = TimeSeriesSplit(n_splits=3)
grid_search = GridSearchCV(clf, param_grid, cv=tscv.split(X))
like image 38
Bert Kellerman Avatar answered Nov 07 '22 18:11

Bert Kellerman


In time series data, Kfold is not a right approach as kfold cv will shuffle your data and you will lose pattern within series. Here is an approach

import xgboost as xgb
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
import numpy as np
X = np.array([[4, 5, 6, 1, 0, 2], [3.1, 3.5, 1.0, 2.1, 8.3, 1.1]]).T
y = np.array([1, 6, 7, 1, 2, 3])
tscv = TimeSeriesSplit(n_splits=2)

model = xgb.XGBRegressor()
param_search = {'max_depth' : [3, 5]}

my_cv = TimeSeriesSplit(n_splits=2).split(X)
gsearch = GridSearchCV(estimator=model, cv=my_cv,
                        param_grid=param_search)
gsearch.fit(X, y)

reference - How do I use a TimeSeriesSplit with a GridSearchCV object to tune a model in scikit-learn?

like image 23
rohan chikorde Avatar answered Nov 07 '22 17:11

rohan chikorde