Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn Kfold acces single fold instead of for loop

After using cross_validation.KFold(n, n_folds=folds) I would like to access the indexes for training and testing of single fold, instead of going through all the folds.

So let's take the example code:

from sklearn import cross_validation
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = cross_validation.KFold(4, n_folds=2)

>>> print(kf)  
sklearn.cross_validation.KFold(n=4, n_folds=2, shuffle=False,
                           random_state=None)
>>> for train_index, test_index in kf:

I would like to access the first fold in kf like this (instead of for loop):

train_index, test_index in kf[0]

This should return just the first fold, but instead I get the error: "TypeError: 'KFold' object does not support indexing"

What I want as output:

>>> train_index, test_index in kf[0]
>>> print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [2 3] TEST: [0 1]

Link: http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html

Question

How do I retrieve the indexes for train and test for only a single fold, without going through the whole for loop?

like image 286
NumesSanguis Avatar asked Dec 09 '14 13:12

NumesSanguis


People also ask

What does shuffle do in KFold?

Shuffled KFold In that case KFold will randomly pick the datapoints which would become part of the train and test set. Or to be precise not completely randomly, random_state influences which points appear each set and the same random_state always results in the same split.

How does Sklearn KFold work?

KFold will provide train/test indices to split data in train and test sets. It will split dataset into k consecutive folds (without shuffling by default). Each fold is then used a validation set once while the k - 1 remaining folds form the training set (source).

What is Model_selection KFold?

Provides train/test indices to split data in train/test sets.


1 Answers

You are on the right track. All you need to do now is:

kf = cross_validation.KFold(4, n_folds=2)
mylist = list(kf)
train, test = mylist[0]

kf is actually a generator, which doesn't compute the train-test split until it is needed. This improves memory usage, as you are not storing items you don't need. Making a list of the KFold object forces it to make all values available.

Here are two great SO question that explain what generators are: one and two


Edit Nov 2018

The API has changed since sklearn 0.20. An updated example (for py3.6):

from sklearn.model_selection import KFold
import numpy as np

kf = KFold(n_splits=4)

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])


X_train, X_test = next(kf.split(X))

In [12]: X_train
Out[12]: array([2, 3])

In [13]: X_test
Out[13]: array([0, 1])
like image 133
mbatchkarov Avatar answered Sep 24 '22 18:09

mbatchkarov