difference between StratifiedKFold and StratifiedShuffleSplit in sklearn

Tags:

As from the title I am wondering what is the difference between

StratifiedKFold with the parameter shuffle = True

StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

and

StratifiedShuffleSplit

StratifiedShuffleSplit(n_splits=10, test_size=’default’, train_size=None, random_state=0)

and what is the advantage of using StratifiedShuffleSplit

759

asked Aug 30 '17 20:08

gabboshow

1 Answers

In KFolds, each test set should not overlap, even with shuffle. With KFolds and shuffle, the data is shuffled once at the start, and then divided into the number of desired splits. The test data is always one of the splits, the train data is the rest.

In ShuffleSplit, the data is shuffled every time, and then split. This means the test sets may overlap between the splits.

See this block for an example of the difference. Note the overlap of the elements in the test sets for ShuffleSplit.

splits = 5  tx = range(10) ty = [0] * 5 + [1] * 5  from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold from sklearn import datasets  kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42) shufflesplit = StratifiedShuffleSplit(n_splits=splits, random_state=42, test_size=2)  print("KFold") for train_index, test_index in kfold.split(tx, ty):     print("TRAIN:", train_index, "TEST:", test_index)  print("Shuffle Split") for train_index, test_index in shufflesplit.split(tx, ty):     print("TRAIN:", train_index, "TEST:", test_index)

Output:

KFold TRAIN: [0 2 3 4 5 6 7 9] TEST: [1 8] TRAIN: [0 1 2 3 5 7 8 9] TEST: [4 6] TRAIN: [0 1 3 4 5 6 8 9] TEST: [2 7] TRAIN: [1 2 3 4 6 7 8 9] TEST: [0 5] TRAIN: [0 1 2 4 5 6 7 8] TEST: [3 9] Shuffle Split TRAIN: [8 4 1 0 6 5 7 2] TEST: [3 9] TRAIN: [7 0 3 9 4 5 1 6] TEST: [8 2] TRAIN: [1 2 5 6 4 8 9 0] TEST: [3 7] TRAIN: [4 6 7 8 3 5 1 2] TEST: [9 0] TRAIN: [7 2 6 5 4 3 0 9] TEST: [1 8]

As for when to use them, I tend to use KFolds for any cross validation, and I use ShuffleSplit with a split of 2 for my train/test set splits. But I'm sure there are other use cases for both.

answered Sep 21 '22 11:09

Ken Syme

Related questions
                            
                                Remove non-numeric rows in one column with pandas
                            
                                Remove nodes from graph or reset entire default graph
                            
                                Python equivalent of npm or rubygems
                            
                                Copy all values in a column to a new column in a pandas dataframe
                            
                                Convert a string to integer with decimal in Python
                            
                                Splitting a string into words and punctuation
                            
                                Exponentials in python: x**y vs math.pow(x, y)
                            
                                No module named serial
                            
                                How can files be added to a tarfile with Python, without adding the directory hierarchy?
                            
                                Python pip broken after OS X 10.8 upgrade
                            
                                Link Conda environment with Jupyter Notebook
                            
                                groupby weighted average and sum in pandas dataframe
                            
                                Python MySQLdb issues (TypeError: %d format: a number is required, not str)
                            
                                No Module named django.core
                            
                                AttributeError: 'Tensor' object has no attribute 'numpy'
                            
                                ImportError: No module named sqlalchemy
                            
                                Are there any reasons not to use an OrderedDict?
                            
                                Run Python script without Windows console appearing
                            
                                datetime to string with series in pandas
                            
                                Interpolate NaN values in a numpy array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

difference between StratifiedKFold and StratifiedShuffleSplit in sklearn

Tags:

python

scikit-learn

cross-validation

gabboshow

People also ask

1 Answers

Ken Syme

Recent Activity

Donate For Us