I have a multi class classification problem and my dataset is skewed, I have 100 instances of a particular class and say 10 of some different class, so I want to split my dataset keeping ratio between classes, if I have 100 instances of a particular class and I want 30% of records to go in the training set I want to have there 30 instances of my 100 record represented class and 3 instances of my 10 record represented class and so on.
All Answers (39) Follow 70/30 rule. 70% for training and 30% for validation.
Generally, the training and validation data set is split into an 80:20 ratio.
We can use the train_test_split to first make the split on the original dataset. Then, to get the validation set, we can apply the same function to the train set to get the validation set. In the function below, the test set size is the ratio of the original data we want to use as the test set.
Empirical studies show that the best results are obtained if we use 20-30% of the data for testing, and the remaining 70-80% of the data for training.
You can use sklearn's StratifiedKFold
, from the online docs:
Stratified K-Folds cross validation iterator
Provides train/test indices to split data in train test sets.
This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
>>> from sklearn import cross_validation
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 1, 1])
>>> skf = cross_validation.StratifiedKFold(y, n_folds=2)
>>> len(skf)
2
>>> print(skf)
sklearn.cross_validation.StratifiedKFold(labels=[0 0 1 1], n_folds=2,
shuffle=False, random_state=None)
>>> for train_index, test_index in skf:
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
TRAIN: [1 3] TEST: [0 2]
TRAIN: [0 2] TEST: [1 3]
This will preserve your class ratios so that the splits retain the class ratios, this will work fine with pandas dfs.
As suggested by @Ali_m you could use StratifiedShuffledSplit
which accepts a split ratio param:
sss = StratifiedShuffleSplit(y, 3, test_size=0.7, random_state=0)
would produce a 70% split.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With