Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stratified Train/Validation/Test-split in scikit-learn

There is already a description here of how to do stratified train/test split in scikit via train_test_split (Stratified Train/Test-split in scikit-learn) and a description of how to random train/validation/test split via np.split (How to split data into 3 sets (train, validation and test)?). But what about doing stratified train/validation/test split.

The closest approximation that comes to mind for doing stratified (on class label) train/validation/test split is as follows, but I suspect there's a better way that can perhaps achieve this in one function call or in a more accurate way:

Let's say we want to do a 60/20/20 train/validation/test split, then my current approach is to first do 60/40 stratified split, then do a 50/50 stratifeid split on that first 40 as to ultimately get a 60/20/20 stratified split.

from sklearn.cross_validation import train_test_split
SEED = 2000
x_train, x_validation_and_test, y_train, y_validation_and_test = train_test_split(x, y, test_size=.4, random_state=SEED)
x_validation, x_test, y_validation, y_test = train_test_split(x_validation_and_test, y_validation_and_test, test_size=.5, random_state=SEED)

Please get back if my approach is correct and/or if you have a better approach.

Thank you

like image 879
blu Avatar asked Nov 27 '16 12:11

blu


People also ask

How do you do a stratify train test split?

We can achieve this by setting the “stratify” argument to the y component of the original dataset. This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array.

Should train test split be stratified?

Stratified Train-Test Splits As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.

What is stratified in train_test_split?

In this context, stratification means that the train_test_split method returns training and test subsets that have the same proportions of class labels as the input dataset.

How can we split train data into train and validation?

Split the dataset We can use the train_test_split to first make the split on the original dataset. Then, to get the validation set, we can apply the same function to the train set to get the validation set. In the function below, the test set size is the ratio of the original data we want to use as the test set.

How to train test split in scikit-learn?

Make sure your data is arranged into a format acceptable for train test split. In scikit-learn, this consists of separating your full dataset into Features and Target. 1. Split the dataset into two pieces: a training set and a testing set.

What is a stratified train-test split in machine learning?

As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset. This is called a stratified train-test split.

How do I split data into train and test sets?

# split into train test sets train, test = train_test_split(dataset,...) Ideally, you can split your original dataset into input (X) and output (y) columns, then call the function passing both arrays and have them split appropriately into train and test subsets.

How do you train test split in Python?

Train-Test Split Procedure in Scikit-Learn The scikit-learn Python machine learning library provides an implementation of the train-test split evaluation procedure via the train_test_split () function. The function takes a loaded dataset as input and returns the dataset split into two subsets.


2 Answers

The solution is to just use StratifiedShuffleSplit twice, like below:

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.4, random_state=42)
for train_index, test_valid_index in split.split(df, df.target):
    train_set = df.iloc[train_index]
    test_valid_set = df.iloc[test_valid_index]

split2 = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=42)
for test_index, valid_index in split2.split(test_valid_set, test_valid_set.target):
    test_set = test_valid_set.iloc[test_index]
    valid_set = test_valid_set.iloc[valid_index]
like image 101
Anton Dergunov Avatar answered Oct 07 '22 15:10

Anton Dergunov


Yes, this is exactly how I would do it - running train_test_split() twice. Think of the first as splitting off your training set, and then that training set may get divided into different folds or holdouts down the line.

In fact, if you end up testing your model using a scikit model that includes built-in cross-validation, you may not even have to explicitly run train_test_split() again. Same if you use the (very handy!) model_selection.cross_val_score function.

like image 25
rocksteady Avatar answered Oct 07 '22 16:10

rocksteady