I am trying to split by dataframe (~188k rows) into train and test sample. The column ('FLAG') is my target variable containing a value of either 0 or 1.
Since there are only about 1300 'FLAG' with value 1, I want to do a stratified split to ensure there is a representative number of 1 values in both samples.
I tried to split using sklearn's train_test_split function:
train, test = train_test_split(df, test_size=0.2, stratify=df["FLAG"])
My problem is, that the resulting train and test sample have 177942, respectively 52 rows. I would have expected something like 150400 and 37600 rows.
My understanding from reading the documentation (sklearn.model_selection.train_test_split) is that I have to provide my dataframe, the test_size and the column containing the target classes (i.e. 'FLAG' in my case).
Even a generic example:
df = pd.DataFrame(data={'a': np.random.rand(100000), 'b': np.random.rand(100000), 'c': 0})
df.loc[np.random.randint(0, 100000, 1000), 'c'] = 1
tr, ts = train_test_split(df, test_size=.2, stratify=df['c'])
print(tr.shape, ts.shape)
Returns: (93105, 3) (38, 3)
My list of imports:
import cx_Oracle
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
My python version: 3.7.0 Sklearn version: 0.20.3 Pandas version: 0.23.4
My investigations showed that the issue is caused by an integer overflow. The issue is happening only on Python 3.7.x 32bit. The 64bit version works fine.
In the end I switched to 64bit Python to resolve the issue (I previously had to use 32bit version due to an unrelated Oracle package dependency).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With