Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to achieve stratified K fold splitting for arbitrary number of categorical variables?

I have a dataframe of the form, df:

    cat_var_1    cat_var_2     num_var_1
0    Orange       Monkey         34
1    Banana        Cat           56
2    Orange        Dog           22
3    Banana       Monkey          6
..

Suppose the possible values of cat_var_1 in the dataset have the ratios- ['Orange': 0.6, 'Banana': 0.4] and the possible values of cat_var_2 have the ratios ['Monkey': 0.2, 'Cat': 0.7, 'Dog': 0.1].

How to I split the data into train, test and validation sets (60:20:20 split) such that the ratios of the categorical variables remain preserved? In practice, these variables can be of any number, not just two. Also, clearly, the exact ratios may never be achieved in practice, but we would like it to be as near as possible.

I have looked into the StratifiedKFold method from sklearn described here: how to split a dataset into training and validation set keeping ratio between classes? but this is restricted to evaluating on the basis of one categorical variable only.

Additionally, I would be grateful if you could provide the complexity of the solution you achieve.

like image 550
Melsauce Avatar asked Feb 26 '18 12:02

Melsauce


People also ask

What is stratified K fold sampling?

Stratified K fold cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class. It provides train/test indices to split data in train/test sets.

How does stratified k-fold cross-validation work?

In stratified k-fold cross-validation, the folds are selected so that the mean response value is approximately equal in all the folds. In the case of a dichotomous classification, this means that each fold contains roughly the same proportions of the two types of class labels.

What is the difference between KFold and stratified KFold?

You need to know what "KFold" and "Stratified" are first. KFold is a cross-validator that divides the dataset into k folds. Stratified is to ensure that each fold of dataset has the same proportion of observations with a given label.

Why do we use stratified K fold?

The stratified k fold cross-validation is an extension of the cross-validation technique used for classification problems. It maintains the same class ratio throughout the K folds as the ratio in the original dataset.


1 Answers

You can pass df.cat_var_1+ "_" + df.cat_var_2 to argument y of StratifiedShuffleSplit.split():

But here is a method that use DataFrame.groupby:

import pandas as pd
import numpy as np

nrows = 10000
p1 = {'Orange': 0.6, 'Banana': 0.4}
p2 = {'Monkey': 0.2, 'Cat': 0.7, 'Dog': 0.1}

c1 = [key for key, val in p1.items() for i in range(int(nrows * val))]
c2 = [key for key, val in p2.items() for i in range(int(nrows * val))]
random.shuffle(c1)
random.shuffle(c2)

df = pd.DataFrame({"c1":c1, "c2":c2, "val":np.random.randint(0, 100, nrows)})

index = []
for key, idx in df.groupby(["c1", "c2"]).groups.items():
    arr = idx.values.copy()
    np.random.shuffle(arr)
    p1 = int(0.6 * len(arr))
    p2 = int(0.8 * len(arr))
    index.append(np.split(arr, [p1, p2]))

idx_train, idx_test, idx_validate = list(map(np.concatenate, zip(*index)))
like image 149
HYRY Avatar answered Oct 11 '22 14:10

HYRY