It seems like KFold generates the same values every time the object is iterated over, while Shuffle Split generates different indices every time. Is this correct? If so, what are the uses for one over the other?
cv = cross_validation.KFold(10, n_folds=2,shuffle=True,random_state=None) cv2 = cross_validation.ShuffleSplit(10,n_iter=2,test_size=0.5) print(list(iter(cv))) print(list(iter(cv))) print(list(iter(cv2))) print(list(iter(cv2)))
Yields the following output:
[(array([1, 3, 5, 8, 9]), array([0, 2, 4, 6, 7])), (array([0, 2, 4, 6, 7]), array([1, 3, 5, 8, 9]))] [(array([1, 3, 5, 8, 9]), array([0, 2, 4, 6, 7])), (array([0, 2, 4, 6, 7]), array([1, 3, 5, 8, 9]))] [(array([4, 6, 3, 2, 7]), array([8, 1, 9, 0, 5])), (array([3, 6, 7, 0, 5]), array([9, 1, 8, 4, 2]))] [(array([3, 0, 2, 1, 7]), array([5, 6, 9, 4, 8])), (array([0, 7, 1, 3, 8]), array([6, 2, 5, 4, 9]))]
KFold will provide train/test indices to split data in train and test sets. It will split dataset into k consecutive folds (without shuffling by default). Each fold is then used a validation set once while the k - 1 remaining folds form the training set (source).
You need to know what "KFold" and "Stratified" are first. KFold is a cross-validator that divides the dataset into k folds. Stratified is to ensure that each fold of dataset has the same proportion of observations with a given label.
How does the KFold with shuffle really work? Each time the KFold is called, it shuffles my indexes and it generates training/test data.
Cross-Validation is basically a resampling technique to make our model sure about its efficiency and accuracy on the unseen data. In short, Model Validation technique, up for other applications. Bunch of train/test splits — testing accuracy for each split — average them.
Difference in KFold and ShuffleSplit output
KFold will divide your data set into prespecified number of folds, and every sample must be in one and only one fold. A fold is a subset of your dataset.
ShuffleSplit will randomly sample your entire dataset during each iteration to generate a training set and a test set. The test_size
and train_size
parameters control how large the test and training test set should be for each iteration. Since you are sampling from the entire dataset during each iteration, values selected during one iteration, could be selected again during another iteration.
Summary: ShuffleSplit works iteratively, KFold just divides the dataset into k folds.
Difference when doing validation
In KFold, during each round you will use one fold as the test set and all the remaining folds as your training set. However, in ShuffleSplit, during each round n
you should only use the training and test set from iteration n
. As your data set grows, cross validation time increases, making shufflesplits a more attractive alternate. If you can train your algorithm, with a certain percentage of your data as opposed to using all k-1 folds, ShuffleSplit is an attractive option.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With