Using sklearn , I want to have 3 splits (i.e. n_splits = 3)in the sample dataset and have a Train/Test ratio as 70:30. I'm able split the set into 3 folds but not able to define the test size (similar to train_test_split method).Is there a way to do define test sample size in StratifiedKFold ?
from sklearn.model_selection import StratifiedKFold as SKF
skf = SKF(n_splits=3)
skf.get_n_splits(X, y)
for train_index, test_index in skf.split(X, y):
# Loops over 3 iterations to have Train test stratified split
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Stratified Sampling is a method of sampling from a population that can be divided into a subset of the population. In this article, I’m going to walk you through a data science tutorial on how to perform stratified sampling with Python.
In short, the size of the test set will be 1/K (i.e. 1/n_splits ), so you can tune that parameter to control the test size (e.g. n_splits=3 will have test split of size 1/3 = 33% of your data). However, StratifiedKFold will iterate over K groups of K-1, and might not be what you want.
But K-Fold Cross Validation also suffer from second problem i.e. random sampling. The solution for both first and second problem is to use Stratified K-Fold Cross-Validation. What is Stratified K-Fold Cross Validation?
StratifiedKFold does by definition a K-fold split. This is, the iterator returned will yield ( K-1) sets for training while 1 set for testing. K is controlled by n_splits, and thus, it does create groups of n_samples/K, and use all combinations of K-1 for training/testing. Refer to wikipedia or google K-fold cross-validation for more info about it.
StratifiedKFold
does by definition a K-fold split. This is, the iterator returned will yield (K-1
) sets for training while 1
set for testing. K
is controlled by n_splits
, and thus, it does create groups of n_samples/K
, and use all combinations of K-1
for training/testing. Refer to wikipedia or google K-fold cross-validation for more info about it.
In short, the size of the test set will be 1/K
(i.e. 1/n_splits
), so you can tune that parameter to control the test size (e.g. n_splits=3
will have test split of size 1/3 = 33%
of your data). However, StratifiedKFold
will iterate over K
groups of K-1
, and might not be what you want.
Having said that, you might be interested in StratifiedShuffleSplit, which returns just configurable number of splits and train/test ratio. If you just want a single split, you can tune n_splits=1
and yet keep test_size=0.3
(or whatever ratio you want).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With