Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

difference between StratifiedKFold and StratifiedShuffleSplit in sklearn

As from the title I am wondering what is the difference between

StratifiedKFold with the parameter shuffle = True

StratifiedKFold(n_splits=10, shuffle=True, random_state=0) 

and

StratifiedShuffleSplit

StratifiedShuffleSplit(n_splits=10, test_size=’default’, train_size=None, random_state=0) 

and what is the advantage of using StratifiedShuffleSplit

like image 759
gabboshow Avatar asked Aug 30 '17 20:08

gabboshow


People also ask

What is the difference between k fold and StratifiedKFold?

You need to know what "KFold" and "Stratified" are first. KFold is a cross-validator that divides the dataset into k folds. Stratified is to ensure that each fold of dataset has the same proportion of observations with a given label.

What does StratifiedShuffleSplit do?

Provides train/test indices to split data in train/test sets. This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.

Is stratified KFold better than KFold?

Therefore, in classifications tasks with imbalanced class distributions, we should prefer StratifiedKFold over KFold. The ratio of class 0 to class 1 is 1/3. If we set k=4, then the test sets include three data points from class 1 and one data point from class 0.

What is N splits in StratifiedShuffleSplit N_splits?

For StratifiedShuffleSplit, 'n_split' specifies the number of times the data needs to be sampled from each strata in the proportion mentioned in 'test_size'. Example: Here's an example of a dataset containing 4 strata and each strata containing 3 records each.

What is stratifiedshufflesplit?

What is StratifiedShuffleSplit? StratifiedShuffleSplit is a combination of both ShuffleSplit and StratifiedKFold. Using StratifiedShuffleSplit the proportion of distribution of class labels is almost even between train and test dataset.

What is the difference between kfold and stratifiedkfold?

StratifiedKFold is a variation of KFold. First, StratifiedKFold shuffles your data, after that splits the data into n_splits parts and Done. Now, it will use each part as a test set. Note that it only and always shuffles data one time before splitting. With shuffle = True, the data is shuffled by your random_state.

What is the difference between kfolds and shufflesplit?

With KFolds and shuffle, the data is shuffled once at the start, and then divided into the number of desired splits. The test data is always one of the splits, the train data is the rest. In ShuffleSplit, the data is shuffled every time, and then split. This means the test sets may overlap between the splits.

What is stratified k-fold cross-validator?

Stratified K-Folds cross-validator. Provides train/test indices to split data in train/test sets. This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.


1 Answers

In KFolds, each test set should not overlap, even with shuffle. With KFolds and shuffle, the data is shuffled once at the start, and then divided into the number of desired splits. The test data is always one of the splits, the train data is the rest.

In ShuffleSplit, the data is shuffled every time, and then split. This means the test sets may overlap between the splits.

See this block for an example of the difference. Note the overlap of the elements in the test sets for ShuffleSplit.

splits = 5  tx = range(10) ty = [0] * 5 + [1] * 5  from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold from sklearn import datasets  kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42) shufflesplit = StratifiedShuffleSplit(n_splits=splits, random_state=42, test_size=2)  print("KFold") for train_index, test_index in kfold.split(tx, ty):     print("TRAIN:", train_index, "TEST:", test_index)  print("Shuffle Split") for train_index, test_index in shufflesplit.split(tx, ty):     print("TRAIN:", train_index, "TEST:", test_index) 

Output:

KFold TRAIN: [0 2 3 4 5 6 7 9] TEST: [1 8] TRAIN: [0 1 2 3 5 7 8 9] TEST: [4 6] TRAIN: [0 1 3 4 5 6 8 9] TEST: [2 7] TRAIN: [1 2 3 4 6 7 8 9] TEST: [0 5] TRAIN: [0 1 2 4 5 6 7 8] TEST: [3 9] Shuffle Split TRAIN: [8 4 1 0 6 5 7 2] TEST: [3 9] TRAIN: [7 0 3 9 4 5 1 6] TEST: [8 2] TRAIN: [1 2 5 6 4 8 9 0] TEST: [3 7] TRAIN: [4 6 7 8 3 5 1 2] TEST: [9 0] TRAIN: [7 2 6 5 4 3 0 9] TEST: [1 8] 

As for when to use them, I tend to use KFolds for any cross validation, and I use ShuffleSplit with a split of 2 for my train/test set splits. But I'm sure there are other use cases for both.

like image 80
Ken Syme Avatar answered Sep 21 '22 11:09

Ken Syme