I have a dataframe of the form, df:
cat_var_1 cat_var_2 num_var_1
0 Orange Monkey 34
1 Banana Cat 56
2 Orange Dog 22
3 Banana Monkey 6
..
Suppose the possible values of cat_var_1 in the dataset have the ratios- ['Orange': 0.6, 'Banana': 0.4] and the possible values of cat_var_2 have the ratios ['Monkey': 0.2, 'Cat': 0.7, 'Dog': 0.1].
How to I split the data into train, test and validation sets (60:20:20 split) such that the ratios of the categorical variables remain preserved? In practice, these variables can be of any number, not just two. Also, clearly, the exact ratios may never be achieved in practice, but we would like it to be as near as possible.
I have looked into the StratifiedKFold method from sklearn described here: how to split a dataset into training and validation set keeping ratio between classes? but this is restricted to evaluating on the basis of one categorical variable only.
Additionally, I would be grateful if you could provide the complexity of the solution you achieve.
Stratified K fold cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class. It provides train/test indices to split data in train/test sets.
In stratified k-fold cross-validation, the folds are selected so that the mean response value is approximately equal in all the folds. In the case of a dichotomous classification, this means that each fold contains roughly the same proportions of the two types of class labels.
You need to know what "KFold" and "Stratified" are first. KFold is a cross-validator that divides the dataset into k folds. Stratified is to ensure that each fold of dataset has the same proportion of observations with a given label.
The stratified k fold cross-validation is an extension of the cross-validation technique used for classification problems. It maintains the same class ratio throughout the K folds as the ratio in the original dataset.
You can pass df.cat_var_1+ "_" + df.cat_var_2
to argument y
of StratifiedShuffleSplit.split()
:
But here is a method that use DataFrame.groupby
:
import pandas as pd
import numpy as np
nrows = 10000
p1 = {'Orange': 0.6, 'Banana': 0.4}
p2 = {'Monkey': 0.2, 'Cat': 0.7, 'Dog': 0.1}
c1 = [key for key, val in p1.items() for i in range(int(nrows * val))]
c2 = [key for key, val in p2.items() for i in range(int(nrows * val))]
random.shuffle(c1)
random.shuffle(c2)
df = pd.DataFrame({"c1":c1, "c2":c2, "val":np.random.randint(0, 100, nrows)})
index = []
for key, idx in df.groupby(["c1", "c2"]).groups.items():
arr = idx.values.copy()
np.random.shuffle(arr)
p1 = int(0.6 * len(arr))
p2 = int(0.8 * len(arr))
index.append(np.split(arr, [p1, p2]))
idx_train, idx_test, idx_validate = list(map(np.concatenate, zip(*index)))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With