As from the title I am wondering what is the difference between
StratifiedKFold with the parameter shuffle = True
StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
and
StratifiedShuffleSplit
StratifiedShuffleSplit(n_splits=10, test_size=’default’, train_size=None, random_state=0)
and what is the advantage of using StratifiedShuffleSplit
You need to know what "KFold" and "Stratified" are first. KFold is a cross-validator that divides the dataset into k folds. Stratified is to ensure that each fold of dataset has the same proportion of observations with a given label.
Provides train/test indices to split data in train/test sets. This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.
Therefore, in classifications tasks with imbalanced class distributions, we should prefer StratifiedKFold over KFold. The ratio of class 0 to class 1 is 1/3. If we set k=4, then the test sets include three data points from class 1 and one data point from class 0.
For StratifiedShuffleSplit, 'n_split' specifies the number of times the data needs to be sampled from each strata in the proportion mentioned in 'test_size'. Example: Here's an example of a dataset containing 4 strata and each strata containing 3 records each.
What is StratifiedShuffleSplit? StratifiedShuffleSplit is a combination of both ShuffleSplit and StratifiedKFold. Using StratifiedShuffleSplit the proportion of distribution of class labels is almost even between train and test dataset.
StratifiedKFold is a variation of KFold. First, StratifiedKFold shuffles your data, after that splits the data into n_splits parts and Done. Now, it will use each part as a test set. Note that it only and always shuffles data one time before splitting. With shuffle = True, the data is shuffled by your random_state.
With KFolds and shuffle, the data is shuffled once at the start, and then divided into the number of desired splits. The test data is always one of the splits, the train data is the rest. In ShuffleSplit, the data is shuffled every time, and then split. This means the test sets may overlap between the splits.
Stratified K-Folds cross-validator. Provides train/test indices to split data in train/test sets. This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
In KFolds, each test set should not overlap, even with shuffle. With KFolds and shuffle, the data is shuffled once at the start, and then divided into the number of desired splits. The test data is always one of the splits, the train data is the rest.
In ShuffleSplit, the data is shuffled every time, and then split. This means the test sets may overlap between the splits.
See this block for an example of the difference. Note the overlap of the elements in the test sets for ShuffleSplit.
splits = 5 tx = range(10) ty = [0] * 5 + [1] * 5 from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold from sklearn import datasets kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42) shufflesplit = StratifiedShuffleSplit(n_splits=splits, random_state=42, test_size=2) print("KFold") for train_index, test_index in kfold.split(tx, ty): print("TRAIN:", train_index, "TEST:", test_index) print("Shuffle Split") for train_index, test_index in shufflesplit.split(tx, ty): print("TRAIN:", train_index, "TEST:", test_index)
Output:
KFold TRAIN: [0 2 3 4 5 6 7 9] TEST: [1 8] TRAIN: [0 1 2 3 5 7 8 9] TEST: [4 6] TRAIN: [0 1 3 4 5 6 8 9] TEST: [2 7] TRAIN: [1 2 3 4 6 7 8 9] TEST: [0 5] TRAIN: [0 1 2 4 5 6 7 8] TEST: [3 9] Shuffle Split TRAIN: [8 4 1 0 6 5 7 2] TEST: [3 9] TRAIN: [7 0 3 9 4 5 1 6] TEST: [8 2] TRAIN: [1 2 5 6 4 8 9 0] TEST: [3 7] TRAIN: [4 6 7 8 3 5 1 2] TEST: [9 0] TRAIN: [7 2 6 5 4 3 0 9] TEST: [1 8]
As for when to use them, I tend to use KFolds for any cross validation, and I use ShuffleSplit with a split of 2 for my train/test set splits. But I'm sure there are other use cases for both.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With