Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to give the test size in stratified kfold sampling in python?

Using sklearn , I want to have 3 splits (i.e. n_splits = 3)in the sample dataset and have a Train/Test ratio as 70:30. I'm able split the set into 3 folds but not able to define the test size (similar to train_test_split method).Is there a way to do define test sample size in StratifiedKFold ?

from sklearn.model_selection import StratifiedKFold as SKF
skf = SKF(n_splits=3)
skf.get_n_splits(X, y)
for train_index, test_index in skf.split(X, y):
# Loops over 3 iterations to have Train test stratified split
     X_train, X_test = X[train_index], X[test_index]
     y_train, y_test = y[train_index], y[test_index]
like image 975
raul Avatar asked Aug 04 '17 07:08

raul


People also ask

What is stratified sampling in Python?

Stratified Sampling is a method of sampling from a population that can be divided into a subset of the population. In this article, I’m going to walk you through a data science tutorial on how to perform stratified sampling with Python.

What is the size of the test set in stratifiedkfold?

In short, the size of the test set will be 1/K (i.e. 1/n_splits ), so you can tune that parameter to control the test size (e.g. n_splits=3 will have test split of size 1/3 = 33% of your data). However, StratifiedKFold will iterate over K groups of K-1, and might not be what you want.

Does k-fold cross validation suffer from second problem of random sampling?

But K-Fold Cross Validation also suffer from second problem i.e. random sampling. The solution for both first and second problem is to use Stratified K-Fold Cross-Validation. What is Stratified K-Fold Cross Validation?

Does stratifiedkfold use k-fold split?

StratifiedKFold does by definition a K-fold split. This is, the iterator returned will yield ( K-1) sets for training while 1 set for testing. K is controlled by n_splits, and thus, it does create groups of n_samples/K, and use all combinations of K-1 for training/testing. Refer to wikipedia or google K-fold cross-validation for more info about it.


1 Answers

StratifiedKFold does by definition a K-fold split. This is, the iterator returned will yield (K-1) sets for training while 1 set for testing. K is controlled by n_splits, and thus, it does create groups of n_samples/K, and use all combinations of K-1 for training/testing. Refer to wikipedia or google K-fold cross-validation for more info about it.

In short, the size of the test set will be 1/K (i.e. 1/n_splits), so you can tune that parameter to control the test size (e.g. n_splits=3 will have test split of size 1/3 = 33% of your data). However, StratifiedKFold will iterate over K groups of K-1, and might not be what you want.

Having said that, you might be interested in StratifiedShuffleSplit, which returns just configurable number of splits and train/test ratio. If you just want a single split, you can tune n_splits=1 and yet keep test_size=0.3 (or whatever ratio you want).

like image 153
Imanol Luengo Avatar answered Jan 03 '23 16:01

Imanol Luengo