Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the difference between KFold and ShuffleSplit CV?

It seems like KFold generates the same values every time the object is iterated over, while Shuffle Split generates different indices every time. Is this correct? If so, what are the uses for one over the other?

cv = cross_validation.KFold(10, n_folds=2,shuffle=True,random_state=None) cv2 = cross_validation.ShuffleSplit(10,n_iter=2,test_size=0.5) print(list(iter(cv))) print(list(iter(cv))) print(list(iter(cv2))) print(list(iter(cv2))) 

Yields the following output:

[(array([1, 3, 5, 8, 9]), array([0, 2, 4, 6, 7])), (array([0, 2, 4, 6, 7]), array([1, 3, 5, 8, 9]))]                                      [(array([1, 3, 5, 8, 9]), array([0, 2, 4, 6, 7])), (array([0, 2, 4, 6, 7]), array([1, 3, 5, 8, 9]))]                                      [(array([4, 6, 3, 2, 7]), array([8, 1, 9, 0, 5])), (array([3, 6, 7, 0, 5]), array([9, 1, 8, 4, 2]))]                                      [(array([3, 0, 2, 1, 7]), array([5, 6, 9, 4, 8])), (array([0, 7, 1, 3, 8]), array([6, 2, 5, 4, 9]))]     
like image 298
rb612 Avatar asked Jan 11 '16 21:01

rb612


People also ask

What does KFold function do?

KFold will provide train/test indices to split data in train and test sets. It will split dataset into k consecutive folds (without shuffling by default). Each fold is then used a validation set once while the k - 1 remaining folds form the training set (source).

What is the difference between K-fold cross-validation and stratified k fold cross-validation?

You need to know what "KFold" and "Stratified" are first. KFold is a cross-validator that divides the dataset into k folds. Stratified is to ensure that each fold of dataset has the same proportion of observations with a given label.

What does shuffle in KFold do?

How does the KFold with shuffle really work? Each time the KFold is called, it shuffles my indexes and it generates training/test data.

What is cross-validation medium?

Cross-Validation is basically a resampling technique to make our model sure about its efficiency and accuracy on the unseen data. In short, Model Validation technique, up for other applications. Bunch of train/test splits — testing accuracy for each split — average them.


1 Answers

Difference in KFold and ShuffleSplit output

KFold will divide your data set into prespecified number of folds, and every sample must be in one and only one fold. A fold is a subset of your dataset.

ShuffleSplit will randomly sample your entire dataset during each iteration to generate a training set and a test set. The test_size and train_size parameters control how large the test and training test set should be for each iteration. Since you are sampling from the entire dataset during each iteration, values selected during one iteration, could be selected again during another iteration.

Summary: ShuffleSplit works iteratively, KFold just divides the dataset into k folds.

Difference when doing validation

In KFold, during each round you will use one fold as the test set and all the remaining folds as your training set. However, in ShuffleSplit, during each round n you should only use the training and test set from iteration n. As your data set grows, cross validation time increases, making shufflesplits a more attractive alternate. If you can train your algorithm, with a certain percentage of your data as opposed to using all k-1 folds, ShuffleSplit is an attractive option.

like image 108
ilyas patanam Avatar answered Oct 19 '22 04:10

ilyas patanam