Test_train_split with stratify

Question

I am trying to split by dataframe (~188k rows) into train and test sample. The column ('FLAG') is my target variable containing a value of either 0 or 1.

Since there are only about 1300 'FLAG' with value 1, I want to do a stratified split to ensure there is a representative number of 1 values in both samples.

I tried to split using sklearn's train_test_split function:

train, test = train_test_split(df, test_size=0.2, stratify=df["FLAG"])

My problem is, that the resulting train and test sample have 177942, respectively 52 rows. I would have expected something like 150400 and 37600 rows.

My understanding from reading the documentation (sklearn.model_selection.train_test_split) is that I have to provide my dataframe, the test_size and the column containing the target classes (i.e. 'FLAG' in my case).

Even a generic example:

df = pd.DataFrame(data={'a': np.random.rand(100000), 'b': np.random.rand(100000), 'c': 0})
df.loc[np.random.randint(0, 100000, 1000), 'c'] = 1
tr, ts = train_test_split(df, test_size=.2, stratify=df['c'])
print(tr.shape, ts.shape)

Returns: (93105, 3) (38, 3)

My list of imports:

import cx_Oracle
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

My python version: 3.7.0 Sklearn version: 0.20.3 Pandas version: 0.23.4

tk78 · Accepted Answer

My investigations showed that the issue is caused by an integer overflow. The issue is happening only on Python 3.7.x 32bit. The 64bit version works fine.

In the end I switched to 64bit Python to resolve the issue (I previously had to use 32bit version due to an unrelated Oracle package dependency).

Test_train_split with stratify

Tags:

python

scikit-learn

tk78

1 Answers

tk78

Recent Activity

Donate For Us

Test_train_split with stratify

Tags:

python

scikit-learn

tk78

1 Answers

tk78

Related questions

Recent Activity

Donate For Us