With sklearn, when you create a new KFold object and shuffle is true, it'll produce a different, newly randomized fold indices. However, every generator from a given KFold object gives the same indices for each fold even when shuffle is true. Why does it work like this?
Example:
from sklearn.cross_validation import KFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(4, n_folds=2, shuffle = True)
for fold in kf:
print fold
print '---second round----'
for fold in kf:
print fold
Output:
(array([2, 3]), array([0, 1]))
(array([0, 1]), array([2, 3]))
---second round----#same indices for the folds
(array([2, 3]), array([0, 1]))
(array([0, 1]), array([2, 3]))
This question was motivated by a comment on this answer. I decided to split it into a new question to prevent that answer from becoming too long.
The general procedure for cross - validation requires the dataset to be shuffled randomly. If data is unordered in nature (i.e. non - Time series) then shuffle = True is right choice.
KFold Cross-Validation with Shuffle In the k-fold cross-validation, the dataset was divided into k values in order. When the shuffle and the random_state value inside the KFold option are set, the data is randomly selected: IN[5] kfs = KFold(n_splits=5, shuffle=True, random_state=2021)
Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default). Each fold is then used once as a validation while the k - 1 remaining folds form the training set.
It will return the K different scores(accuracy percentage), which are based on kth test data set.
A new iteration with the same KFold object will not reshuffle the indices, that only happens during instantiation of the object. KFold()
never sees the data but knows number of samples so it uses that to shuffle the indices. From the code during instantiation of KFold:
if shuffle:
rng = check_random_state(self.random_state)
rng.shuffle(self.idxs)
Each time a generator is called to iterate through the indices of each fold, it will use same shuffled indices and divide them the same way.
Take a look at the code for the base class of KFold _PartitionIterator(with_metaclass(ABCMeta))
where __iter__
is defined. The __iter__
method in the base class calls _iter_test_indices
in KFold to divide and yield the train and test indices for each fold.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With