I am using sklearn for multi-classification task. I need to split alldata into train_set and test_set. I want to take randomly the same sample number from each class. Actually, I amusing this function <pre class="prettyprint"><code>X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0) </code></pre> but it gives unbalanced dataset! Any suggestion.

Although Christian's suggestion is correct, technically <code>train_test_split</code> should give you stratified results by using the <code>stratify</code> param. So you could do: <pre class="prettyprint"><code>X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0, stratify=Target) </code></pre> The trick here is that it starts from version <code>0.17</code> in <code>sklearn</code>. From the documentation about the parameter <code>stratify</code>: <blockquote> stratify : array-like or None (default is None) If not None, data is split in a stratified fashion, using this as the labels array. New in version 0.17: stratify splitting </blockquote>

How to split data on balanced training set and test set on sklearn

Tags:

machine-learning

svm

scikit-learn

cross-validation

I am using sklearn for multi-classification task. I need to split alldata into train_set and test_set. I want to take randomly the same sample number from each class. Actually, I amusing this function

X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0)

but it gives unbalanced dataset! Any suggestion.

257

asked Feb 18 '16 04:02

Jeanne

2 Answers

Although Christian's suggestion is correct, technically train_test_split should give you stratified results by using the stratify param.

So you could do:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0, stratify=Target)

The trick here is that it starts from version 0.17 in sklearn.

From the documentation about the parameter stratify:

stratify : array-like or None (default is None) If not None, data is split in a stratified fashion, using this as the labels array. New in version 0.17: stratify splitting

119

answered Oct 21 '22 17:10

Guiem Bosch

You can use StratifiedShuffleSplit to create datasets featuring the same percentage of classes as the original one:

import numpy as np from sklearn.model_selection import StratifiedShuffleSplit X = np.array([[1, 3], [3, 7], [2, 4], [4, 8]]) y = np.array([0, 1, 0, 1]) stratSplit = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=42) for train_idx, test_idx in stratSplit:     X_train=X[train_idx]     y_train=y[train_idx]  print(X_train) # [[3 7] #  [2 4]] print(y_train) # [1 0]

answered Oct 21 '22 18:10

Christian Hirsch

Related questions
                            
                                Suggest what user could buy if he already has something in the cart
                            
                                importance of PCA or SVD in machine learning
                            
                                TensorFlow operator overloading
                            
                                How to understand the term `tensor` in TensorFlow?
                            
                                Neural Networks: What does "linearly separable" mean?
                            
                                xgboost in R: how does xgb.cv pass the optimal parameters into xgb.train
                            
                                How to pick a language for Artificial Intelligence programming? [closed]
                            
                                ResNet: 100% accuracy during training, but 33% prediction accuracy with the same data
                            
                                Correlated features and classification accuracy
                            
                                Machine Learning & Big Data [closed]
                            
                                Machine Learning Algorithm for Predicting Order of Events?
                            
                                Hyperparameter optimization for Pytorch model [closed]
                            
                                Difference between standardscaler and Normalizer in sklearn.preprocessing
                            
                                How to understand SpatialDropout1D and when to use it?
                            
                                Does ImageDataGenerator add more images to my dataset?
                            
                                Can anyone give a real life example of supervised learning and unsupervised learning? [closed]
                            
                                Kmeans without knowing the number of clusters? [duplicate]
                            
                                What is the difference between UpSampling2D and Conv2DTranspose functions in keras?
                            
                                import input_data MNIST tensorflow not working
                            
                                What is the difference between back-propagation and feed-forward Neural Network?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With