For some reason train_test_split, despite lengths being identical and indexes look the same, triggers this error.
from sklearn.model_selection import KFold
data = {'col1':[30.5,45,1,99,6,5,4,2,5,7,7,3], 'col2':[99.5, 98, 95, 90,1,5,6,7,4,4,3,3],'col3':[23, 23.6, 3, 90,1,9,60,9,7,2,2,1]}
df = pd.DataFrame(data)
train, test = train_test_split(df, test_size=0.10)
X = train[['col1', 'col2']]
y2 = train['col3']
X = np.array(X)
kf = KFold(n_splits=3, shuffle=True)
for train_index, test_index in kf.split(X):
X_train, y_train = X[train_index], y[train_index]
y is a pandas Series (same length as x). x was a dataframe with about 20 numerical columns casted to numpy array.
For some reason train_test_split triggers the error despite the lengths being identical.
If i dont call train_test_split it works fine.
the last line triggering the error due to trying to index numpy array this way: y[train_ind]
if that helps anyone, I was having the same problem using .groupby function on a dataframe. I fixed it by using:
df.reset_index(drop=True, inplace=True)
I've tried to create a scenario for your situation.
I've created following dataframe:
col1 col2 col3
0 1 2 1
1 3 4 0
2 5 6 1
3 7 8 0
4 9 10 1
5 11 12 0
6 13 14 1
7 15 16 0
8 17 18 1
9 19 20 0
10 21 22 1
11 23 24 0
12 25 26 1
13 27 28 0
14 29 30 1
I set col1
and col2
for X and col3
for y. After this I've converted X to numpy array as following. Only difference is I've used shuffle
in KFold
.
X = df[['col1', 'col2']]
y = df['col3']
X = np.array(X)
kf = KFold(n_splits=3, shuffle=True)
for train_index, test_index in kf.split(X):
X_train, y_train = X[train_index], y[train_index]
And it worked well. So please check my code and your code and clarify it if there is something I missed.
I assume y2 is y. So y type is still Series
, you need to use .iloc
for it. Following code worked well.
data = {'col1':[30.5,45,1,99,6,5,4,2,5,7,7,3], 'col2':[99.5, 98, 95, 90,1,5,6,7,4,4,3,3],'col3':[23, 23.6, 3, 90,1,9,60,9,7,2,2,1]}
df = pd.DataFrame(data)
train, test = train_test_split(df, test_size=0.10)
X = train[['col1', 'col2']]
y = train['col3']
X = np.array(X)
kf = KFold(n_splits=3, shuffle=True)
for train_index, test_index in kf.split(X):
X_train, y_train = X[train_index], y.iloc[train_index]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With