I was using StratifiedKFold from scikit-learn, but now I need to watch also for "groups". There is nice function GroupKFold, but my data are very time dependent. So similary as in help, ie number of week is the grouping index. But each week should be only in one fold.
Suppose I need 10 folds. What I need is to shuffle data first, before I can used GroupKFold.
Shuffling is in group sence - so whole groups should be shuffle among each other.
Is there way to do is with scikit-learn elegant somehow? Seems to me GroupKFold is robust to shuffle data first.
If there is no way to do it with scikit, can anyone write some effective code of this? I have large data sets.
matrix, label, groups as inputs
By default no shuffling occurs, including for the (stratified) K fold cross- validation performed by specifying cv=some_integer to cross_val_score , grid search, etc. Keep in mind that train_test_split still returns a random split.
Shuffle-Group(s)-Out cross-validation iterator. Provides randomized train/test indices to split data according to a third-party provided group. This group information can be used to encode arbitrary domain specific stratifications of the samples as integers.
KFold Cross-Validation with Shuffle In the k-fold cross-validation, the dataset was divided into k values in order. When the shuffle and the random_state value inside the KFold option are set, the data is randomly selected: IN[5] kfs = KFold(n_splits=5, shuffle=True, random_state=2021)
→ Shuffle Split Method: Repeated random subsampling validation also referred to as Monte Carlo cross-validation splits the dataset randomly into training and validation. Unlikely k-fold cross-validation split of the dataset into not in groups or folds but splits in this case in random.
The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds)
In the GroupKfold the shape of the group is the same as data shape
For data in X, y and groups:
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
import datetime
X = np.array([[1,2,1,1], [3,4,7,8], [5,6,1,3], [7,8,4,7]])
y=np.array([0,2,1,2])
groups=np.array([2,1,0,1])
group_kfold = GroupKFold(n_splits=len(groups.unique))
group_kfold.get_n_splits(X, y, groups)
param_grid ={
'min_child_weight': [50,100],
'subsample': [0.1,0.2],
'colsample_bytree': [0.1,0.2],
'max_depth': [2,3],
'learning_rate': [0.01],
'n_estimators': [100,500],
'reg_lambda': [0.1,0.2]
}
xgb = XGBClassifier()
grid_search = GridSearchCV(xgb, param_grid, cv=group_kfold.split(X, Y, groups), n_jobs=-1)
result = grid_search.fit(X,Y)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With