Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-learn, GroupKFold with shuffling groups?

I was using StratifiedKFold from scikit-learn, but now I need to watch also for "groups". There is nice function GroupKFold, but my data are very time dependent. So similary as in help, ie number of week is the grouping index. But each week should be only in one fold.

Suppose I need 10 folds. What I need is to shuffle data first, before I can used GroupKFold.

Shuffling is in group sence - so whole groups should be shuffle among each other.

Is there way to do is with scikit-learn elegant somehow? Seems to me GroupKFold is robust to shuffle data first.

If there is no way to do it with scikit, can anyone write some effective code of this? I have large data sets.

matrix, label, groups as inputs

like image 328
gugatr0n1c Avatar asked Nov 26 '16 14:11

gugatr0n1c


People also ask

Does Sklearn cross-validation shuffle?

By default no shuffling occurs, including for the (stratified) K fold cross- validation performed by specifying cv=some_integer to cross_val_score , grid search, etc. Keep in mind that train_test_split still returns a random split.

What is group shuffle split?

Shuffle-Group(s)-Out cross-validation iterator. Provides randomized train/test indices to split data according to a third-party provided group. This group information can be used to encode arbitrary domain specific stratifications of the samples as integers.

What is shuffle in KFold?

KFold Cross-Validation with Shuffle In the k-fold cross-validation, the dataset was divided into k values in order. When the shuffle and the random_state value inside the KFold option are set, the data is randomly selected: IN[5] kfs = KFold(n_splits=5, shuffle=True, random_state=2021)

What is shuffle split cross-validation?

→ Shuffle Split Method: Repeated random subsampling validation also referred to as Monte Carlo cross-validation splits the dataset randomly into training and validation. Unlikely k-fold cross-validation split of the dataset into not in groups or folds but splits in this case in random.


1 Answers

The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds)

In the GroupKfold the shape of the group is the same as data shape

For data in X, y and groups:

import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
import datetime

X = np.array([[1,2,1,1], [3,4,7,8], [5,6,1,3], [7,8,4,7]])
y=np.array([0,2,1,2])
groups=np.array([2,1,0,1])  
group_kfold = GroupKFold(n_splits=len(groups.unique))
group_kfold.get_n_splits(X, y, groups)

 param_grid ={
        'min_child_weight': [50,100],
        'subsample': [0.1,0.2],
        'colsample_bytree': [0.1,0.2],
        'max_depth': [2,3],
        'learning_rate': [0.01],
        'n_estimators': [100,500],
        'reg_lambda': [0.1,0.2]        
        }

xgb = XGBClassifier()

grid_search = GridSearchCV(xgb, param_grid, cv=group_kfold.split(X, Y, groups), n_jobs=-1)

result = grid_search.fit(X,Y)
like image 62
Mukul Gupta Avatar answered Sep 17 '22 03:09

Mukul Gupta