How to achieve stratified K fold splitting for arbitrary number of categorical variables?

Tags:

I have a dataframe of the form, df:

    cat_var_1    cat_var_2     num_var_1
0    Orange       Monkey         34
1    Banana        Cat           56
2    Orange        Dog           22
3    Banana       Monkey          6
..

Suppose the possible values of cat_var_1 in the dataset have the ratios- ['Orange': 0.6, 'Banana': 0.4] and the possible values of cat_var_2 have the ratios ['Monkey': 0.2, 'Cat': 0.7, 'Dog': 0.1].

How to I split the data into train, test and validation sets (60:20:20 split) such that the ratios of the categorical variables remain preserved? In practice, these variables can be of any number, not just two. Also, clearly, the exact ratios may never be achieved in practice, but we would like it to be as near as possible.

I have looked into the StratifiedKFold method from sklearn described here: how to split a dataset into training and validation set keeping ratio between classes? but this is restricted to evaluating on the basis of one categorical variable only.

Additionally, I would be grateful if you could provide the complexity of the solution you achieve.

550

asked Feb 26 '18 12:02

Melsauce

1 Answers

You can pass df.cat_var_1+ "_" + df.cat_var_2 to argument y of StratifiedShuffleSplit.split():

But here is a method that use DataFrame.groupby:

import pandas as pd
import numpy as np

nrows = 10000
p1 = {'Orange': 0.6, 'Banana': 0.4}
p2 = {'Monkey': 0.2, 'Cat': 0.7, 'Dog': 0.1}

c1 = [key for key, val in p1.items() for i in range(int(nrows * val))]
c2 = [key for key, val in p2.items() for i in range(int(nrows * val))]
random.shuffle(c1)
random.shuffle(c2)

df = pd.DataFrame({"c1":c1, "c2":c2, "val":np.random.randint(0, 100, nrows)})

index = []
for key, idx in df.groupby(["c1", "c2"]).groups.items():
    arr = idx.values.copy()
    np.random.shuffle(arr)
    p1 = int(0.6 * len(arr))
    p2 = int(0.8 * len(arr))
    index.append(np.split(arr, [p1, p2]))

idx_train, idx_test, idx_validate = list(map(np.concatenate, zip(*index)))

149

answered Oct 11 '22 14:10

HYRY

Related questions
                            
                                cv2.connectedComponents not detecting components
                            
                                Scope of caught exception instance in Python 2 and 3
                            
                                Database "is being accessed by other users" error when using ThreadPoolExecutor with Django
                            
                                Outer merging two data frames in place in pandas
                            
                                Workaround for using __name__=='__main__' in Python multiprocessing
                            
                                Save and load two ML models in pyspark
                            
                                Error when creating a custom response message
                            
                                How to use TensorFlow metrics in Keras
                            
                                python cx_oracle cursor.rowcount returning 0 but cursor.fetchall returns data
                            
                                Unsupported hash type error while installing hashlib using pip3
                            
                                python importlib no module named
                            
                                How could I add a column to a DataFrame in Pyspark with incremental values?
                            
                                How to indicate multiple unused values in Python?
                            
                                "Merging" numpy arrays together with a common dimension [duplicate]
                            
                                Creating a Bigquery table by Python API
                            
                                TensorFlow Eager Mode: How to restore a model from a checkpoint?
                            
                                Pandas merge TypeError: object of type 'NoneType' has no len()
                            
                                Treat an emoji as one character in a regex [duplicate]
                            
                                Best way to subset a pandas dataframe [closed]
                            
                                Is line continuation with backslash dangerous in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to achieve stratified K fold splitting for arbitrary number of categorical variables?

Tags:

python

pandas

machine-learning

numpy

scikit-learn

Melsauce

People also ask

1 Answers

HYRY

Recent Activity

Donate For Us