Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split data on balanced training set and test set on sklearn

I am using sklearn for multi-classification task. I need to split alldata into train_set and test_set. I want to take randomly the same sample number from each class. Actually, I amusing this function

X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0) 

but it gives unbalanced dataset! Any suggestion.

like image 257
Jeanne Avatar asked Feb 18 '16 04:02

Jeanne


People also ask

How do you split data into training and testing Sklearn?

Use train_test_split() to get training and test sets. Control the size of the subsets with the parameters train_size and test_size. Determine the randomness of your splits with the random_state parameter. Obtain stratified splits with the stratify parameter.

How do you split data into training validation and testing in Python?

Split the dataset We can use the train_test_split to first make the split on the original dataset. Then, to get the validation set, we can apply the same function to the train set to get the validation set. In the function below, the test set size is the ratio of the original data we want to use as the test set.

How would you split the train Dev test set?

The best and most secure way to split the data into these three sets is to have one directory for train, one for dev and one for test. For instance if you have a dataset of images, you could have a structure like this with 80% in the training set, 10% in the dev set and 10% in the test set.


2 Answers

Although Christian's suggestion is correct, technically train_test_split should give you stratified results by using the stratify param.

So you could do:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0, stratify=Target) 

The trick here is that it starts from version 0.17 in sklearn.

From the documentation about the parameter stratify:

stratify : array-like or None (default is None) If not None, data is split in a stratified fashion, using this as the labels array. New in version 0.17: stratify splitting

like image 119
Guiem Bosch Avatar answered Oct 21 '22 17:10

Guiem Bosch


You can use StratifiedShuffleSplit to create datasets featuring the same percentage of classes as the original one:

import numpy as np from sklearn.model_selection import StratifiedShuffleSplit X = np.array([[1, 3], [3, 7], [2, 4], [4, 8]]) y = np.array([0, 1, 0, 1]) stratSplit = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=42) for train_idx, test_idx in stratSplit:     X_train=X[train_idx]     y_train=y[train_idx]  print(X_train) # [[3 7] #  [2 4]] print(y_train) # [1 0] 
like image 34
Christian Hirsch Avatar answered Oct 21 '22 18:10

Christian Hirsch