I am using sklearn for multi-classification task. I need to split alldata into train_set and test_set. I want to take randomly the same sample number from each class. Actually, I amusing this function
X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0)
but it gives unbalanced dataset! Any suggestion.
Use train_test_split() to get training and test sets. Control the size of the subsets with the parameters train_size and test_size. Determine the randomness of your splits with the random_state parameter. Obtain stratified splits with the stratify parameter.
Split the dataset We can use the train_test_split to first make the split on the original dataset. Then, to get the validation set, we can apply the same function to the train set to get the validation set. In the function below, the test set size is the ratio of the original data we want to use as the test set.
The best and most secure way to split the data into these three sets is to have one directory for train, one for dev and one for test. For instance if you have a dataset of images, you could have a structure like this with 80% in the training set, 10% in the dev set and 10% in the test set.
Although Christian's suggestion is correct, technically train_test_split
should give you stratified results by using the stratify
param.
So you could do:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0, stratify=Target)
The trick here is that it starts from version 0.17
in sklearn
.
From the documentation about the parameter stratify
:
stratify : array-like or None (default is None) If not None, data is split in a stratified fashion, using this as the labels array. New in version 0.17: stratify splitting
You can use StratifiedShuffleSplit to create datasets featuring the same percentage of classes as the original one:
import numpy as np from sklearn.model_selection import StratifiedShuffleSplit X = np.array([[1, 3], [3, 7], [2, 4], [4, 8]]) y = np.array([0, 1, 0, 1]) stratSplit = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=42) for train_idx, test_idx in stratSplit: X_train=X[train_idx] y_train=y[train_idx] print(X_train) # [[3 7] # [2 4]] print(y_train) # [1 0]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With