Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn kfold returning wrong indexes in python

I am using kfold function from sklearn package in python on a df (data frame) with non-contious row indexes.

this is the code:

kFold = KFold(n_splits=10, shuffle=True, random_state=None)
for train_index, test_index in kFold.split(dfNARemove):...

I get some train_index or test_index that doesn't exist in my df.

what can I do?

like image 320
HilaD Avatar asked Oct 08 '17 16:10

HilaD


People also ask

What does KFold return Sklearn?

It will return the K different scores(accuracy percentage), which are based on kth test data set. And we generally take the average to analyse the model.

What is Sklearn Model_selection KFold?

Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default). Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

What is shuffle in KFold?

If shuffle is True, the whole data is first shuffled and then split into the K-Folds. For repeatable behavior, you can set the random_state, for example to an integer seed (random_state=0). If your parameters depend on the shuffling, this means your parameter selection is very unstable.


1 Answers

kFold iterator yields to you positional indices of train and validation objects of DataFrame, not their non-continuous indices. You can access your train and validation objects by using .iloc pandas method:

kFold = KFold(n_splits=10, shuffle=True, random_state=None)
for train_index, test_index in kFold.split(dfNARemove):
    train_data = dfNARemove.iloc[train_index]
    test_data = dfNARemove.iloc[test_index]

If you want to know, which non-continuous indices used for train_index and test_index on each fold, you can do following:

non_continuous_train_index = dfNARemove.index[train_index]
non_continuous_test_index = dfNARemove.index[test_index]
like image 155
Eduard Ilyasov Avatar answered Oct 23 '22 22:10

Eduard Ilyasov