I'm a relatively new user to sklearn and have run into some unexpected behavior in train_test_split from sklearn.model_selection. I have a pandas dataframe that I would like to split into a training and test set. I would like to stratify my data by at least 2, but ideally 4 columns in my dataframe. There were no warnings from sklearn when I tried to do this, however I found later that there were repeated rows in my final data set. I created a sample test to show this behavior: <pre class="prettyprint"><code>from sklearn.model_selection import train_test_split a = np.array([i for i in range(1000000)]) b = [i%10 for i in a] c = [i%5 for i in a] df = pd.DataFrame({'a':a, 'b':b, 'c':c}) </code></pre> It seems to work as expected if I stratify by either column: <pre class="prettyprint"><code>train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b']]) print(len(train.a.values)) # prints 800000 print(len(set(train.a.values))) # prints 800000 train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['c']]) print(len(train.a.values)) # prints 800000 print(len(set(train.a.values))) # prints 800000 </code></pre> But when I try to stratify by both columns, I get repeated values: <pre class="prettyprint"><code>train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b', 'c']]) print(len(train.a.values)) # prints 800000 print(len(set(train.a.values))) # prints 640000 </code></pre>

The reason you're getting duplicates is because <code>train_test_split()</code> eventually defines strata as the unique set of values of whatever you passed into the <code>stratify</code> argument. Since strata are defined from two columns, one row of data may represent more than one stratum, and so sampling may choose the same row twice because it thinks it's sampling from different classes. The <code>train_test_split()</code> function calls <code>StratifiedShuffleSplit</code>, which uses <code>np.unique()</code> on <code>y</code> (which is what you pass in via <code>stratify</code>). From the source code: <pre class="prettyprint"><code>classes, y_indices = np.unique(y, return_inverse=True) n_classes = classes.shape[0] </code></pre> Here's a simplified sample case, a variation on the example you provided: <pre class="prettyprint"><code>from sklearn.model_selection import train_test_split import numpy as np import pandas as pd N = 20 a = np.arange(N) b = np.random.choice(["foo","bar"], size=N) c = np.random.choice(["y","z"], size=N) df = pd.DataFrame({'a':a, 'b':b, 'c':c}) print(df) a b c 0 0 bar y 1 1 foo y 2 2 bar z 3 3 bar y 4 4 foo z 5 5 bar y ... </code></pre> The stratification function thinks there are four classes to split on: <code>foo</code>, <code>bar</code>, <code>y</code>, and <code>z</code>. But since these classes are essentially nested, meaning <code>y</code> and <code>z</code> both show up in <code>b == foo</code> and <code>b == bar</code>, we'll get duplicates when the splitter tries to sample from each class. <pre class="prettyprint"><code>train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b', 'c']]) print(len(train.a.values)) # 16 print(len(set(train.a.values))) # 12 print(train) a b c 3 3 bar y # selecting a = 3 for b = bar* 5 5 bar y 13 13 foo y 4 4 foo z 14 14 bar z 10 10 foo z 3 3 bar y # selecting a = 3 for c = y 6 6 bar y 16 16 foo y 18 18 bar z 6 6 bar y 8 8 foo y 18 18 bar z 7 7 bar z 4 4 foo z 19 19 bar y #* We can't be sure which row is selecting for `bar` or `y`, # I'm just illustrating the idea here. </code></pre> There's a larger design question here: Do you want to used nested stratified sampling, or do you actually just want to treat each class in <code>df.b</code> and <code>df.c</code> as a separate class to sample from? If the latter, that's what you're already getting. The former is more complicated, and that's not what <code>train_test_split</code> is set up to do. You might find this discussion of nested stratified sampling useful.

sklearn train_test_split on pandas stratify by multiple columns

Tags:

python

pandas

scikit-learn

I'm a relatively new user to sklearn and have run into some unexpected behavior in train_test_split from sklearn.model_selection. I have a pandas dataframe that I would like to split into a training and test set. I would like to stratify my data by at least 2, but ideally 4 columns in my dataframe.

There were no warnings from sklearn when I tried to do this, however I found later that there were repeated rows in my final data set. I created a sample test to show this behavior:

from sklearn.model_selection import train_test_split a = np.array([i for i in range(1000000)]) b = [i%10 for i in a] c = [i%5 for i in a] df = pd.DataFrame({'a':a, 'b':b, 'c':c})

It seems to work as expected if I stratify by either column:

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b']]) print(len(train.a.values))  # prints 800000 print(len(set(train.a.values)))  # prints 800000  train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['c']]) print(len(train.a.values))  # prints 800000 print(len(set(train.a.values)))  # prints 800000

But when I try to stratify by both columns, I get repeated values:

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b', 'c']]) print(len(train.a.values))  # prints 800000 print(len(set(train.a.values)))  # prints 640000

879

asked Aug 04 '17 22:08

Caitlin

2 Answers

If you want train_test_split to behave as you expected (stratify by multiple columns with no duplicates), create a new column that is a concatenation of the values in your other columns and stratify on the new column.

df['bc'] = df['b'].astype(str) + df['c'].astype(str) train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['bc']])

If you're worried about collision due to values like 11 and 3 and 1 and 13 both creating a concatenated value of 113, then you can add some arbitrary string in the middle:

df['bc'] = df['b'].astype(str) + "_" + df['c'].astype(str)

answered Sep 20 '22 18:09

Sesquipedalism

The reason you're getting duplicates is because train_test_split() eventually defines strata as the unique set of values of whatever you passed into the stratify argument. Since strata are defined from two columns, one row of data may represent more than one stratum, and so sampling may choose the same row twice because it thinks it's sampling from different classes.

The train_test_split() function calls StratifiedShuffleSplit, which uses np.unique() on y (which is what you pass in via stratify). From the source code:

classes, y_indices = np.unique(y, return_inverse=True) n_classes = classes.shape[0]

Here's a simplified sample case, a variation on the example you provided:

from sklearn.model_selection import train_test_split import numpy as np import pandas as pd  N = 20 a = np.arange(N) b = np.random.choice(["foo","bar"], size=N) c = np.random.choice(["y","z"], size=N) df = pd.DataFrame({'a':a, 'b':b, 'c':c})  print(df)      a    b  c 0    0  bar  y 1    1  foo  y 2    2  bar  z 3    3  bar  y 4    4  foo  z 5    5  bar  y ...

The stratification function thinks there are four classes to split on: foo, bar, y, and z. But since these classes are essentially nested, meaning y and z both show up in b == foo and b == bar, we'll get duplicates when the splitter tries to sample from each class.

train, test = train_test_split(df, test_size=0.2, random_state=0,                                 stratify=df[['b', 'c']]) print(len(train.a.values))  # 16 print(len(set(train.a.values)))  # 12  print(train)      a    b  c 3    3  bar  y   # selecting a = 3 for b = bar* 5    5  bar  y 13  13  foo  y 4    4  foo  z 14  14  bar  z 10  10  foo  z 3    3  bar  y   # selecting a = 3 for c = y 6    6  bar  y 16  16  foo  y 18  18  bar  z 6    6  bar  y 8    8  foo  y 18  18  bar  z 7    7  bar  z 4    4  foo  z 19  19  bar  y  #* We can't be sure which row is selecting for `bar` or `y`,  #  I'm just illustrating the idea here.

There's a larger design question here: Do you want to used nested stratified sampling, or do you actually just want to treat each class in df.b and df.c as a separate class to sample from? If the latter, that's what you're already getting. The former is more complicated, and that's not what train_test_split is set up to do.

You might find this discussion of nested stratified sampling useful.

answered Sep 19 '22 18:09

andrew_reece

Related questions
                            
                                Programming on samsung chromebook [closed]
                            
                                Django REST Framework - Serializing optional fields
                            
                                Draw graph in NetworkX
                            
                                Tutorial for scipy.cluster.hierarchy [closed]
                            
                                Is there a way to find an element by attributes in Python Selenium?
                            
                                How do I drop a MongoDB database using PyMongo?
                            
                                Python Django: You're using the staticfiles app without having set the STATIC_ROOT setting
                            
                                Scheduling a .py file on Task Scheduler in Windows 10
                            
                                How can I programmatically authenticate a user in Django?
                            
                                How can I implement a C++ class in Python, to be called by C++?
                            
                                Is it possible to include subdirectories using dist utils (setup.py) as part of package data?
                            
                                error : NameError: name 'subprocess' is not defined [closed]
                            
                                Accessing dictionary value by index in python [duplicate]
                            
                                Is there a way to set a default parameter equal to another parameter value?
                            
                                Requesting password in IPython notebook
                            
                                How to Display Custom Images in Tensorboard (e.g. Matplotlib Plots)?
                            
                                How to save a model without sending a signal?
                            
                                Eat memory using Python
                            
                                python and pandas - how to access a column using iterrows
                            
                                Range as dictionary key in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With