Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas 'Passing list-likes to .loc or [] with any missing labels is no longer supported' on train_test_split returned data

For some reason train_test_split, despite lengths being identical and indexes look the same, triggers this error.

from sklearn.model_selection import KFold

data = {'col1':[30.5,45,1,99,6,5,4,2,5,7,7,3], 'col2':[99.5, 98, 95, 90,1,5,6,7,4,4,3,3],'col3':[23, 23.6, 3, 90,1,9,60,9,7,2,2,1]} 
df = pd.DataFrame(data)

train, test = train_test_split(df, test_size=0.10)
X = train[['col1', 'col2']]
y2 = train['col3']

X = np.array(X)

kf = KFold(n_splits=3, shuffle=True)
for train_index, test_index in kf.split(X):
    X_train, y_train = X[train_index], y[train_index]

y is a pandas Series (same length as x). x was a dataframe with about 20 numerical columns casted to numpy array.

For some reason train_test_split triggers the error despite the lengths being identical.

If i dont call train_test_split it works fine.

the last line triggering the error due to trying to index numpy array this way: y[train_ind]

like image 687
Danny W Avatar asked Feb 28 '20 21:02

Danny W


2 Answers

if that helps anyone, I was having the same problem using .groupby function on a dataframe. I fixed it by using:

df.reset_index(drop=True, inplace=True)
like image 186
tezzaaa Avatar answered Oct 17 '22 13:10

tezzaaa


I've tried to create a scenario for your situation.

I've created following dataframe:

    col1  col2  col3
0      1     2     1
1      3     4     0
2      5     6     1
3      7     8     0
4      9    10     1
5     11    12     0
6     13    14     1
7     15    16     0
8     17    18     1
9     19    20     0
10    21    22     1
11    23    24     0
12    25    26     1
13    27    28     0
14    29    30     1

I set col1 and col2 for X and col3 for y. After this I've converted X to numpy array as following. Only difference is I've used shuffle in KFold.

X = df[['col1', 'col2']]
y = df['col3']
X = np.array(X)
kf = KFold(n_splits=3, shuffle=True)
for train_index, test_index in kf.split(X):
    X_train, y_train = X[train_index], y[train_index]

And it worked well. So please check my code and your code and clarify it if there is something I missed.

Update

I assume y2 is y. So y type is still Series, you need to use .iloc for it. Following code worked well.

data = {'col1':[30.5,45,1,99,6,5,4,2,5,7,7,3], 'col2':[99.5, 98, 95, 90,1,5,6,7,4,4,3,3],'col3':[23, 23.6, 3, 90,1,9,60,9,7,2,2,1]}
df = pd.DataFrame(data)
train, test = train_test_split(df, test_size=0.10)

X = train[['col1', 'col2']]
y = train['col3']

X = np.array(X)

kf = KFold(n_splits=3, shuffle=True)
for train_index, test_index in kf.split(X):
    X_train, y_train = X[train_index], y.iloc[train_index]
like image 12
talatccan Avatar answered Oct 17 '22 13:10

talatccan