Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use shuffle in KFold in scikit_learn

I am running 10-fold CV using the KFold function provided by scikit-learn in order to select some kernel parameters. I am implementing this (grid_search)procedure:

1-pick up a selection of parameters
2-generate a svm
3-generate a KFold
4-get the data that correspons to training/cv_test
5-train the model (clf.fit)
6-classify with the cv_testdata
7-calculate the cv-error 
8-repeat 1-7
9-When ready pick the parameters that provide the lowest average(cv-error)

If I do not use shuffle in the KFold generation, I get very much the same results for the average( cv_errors) if I repeat the same runs and the "best results" are repeatable. If I use the shuffle, I am getting different values for the average (cv-errors) if I repeat the same run several times and the "best values" are not repeatable. I can understand that I should get different cv_errors for each KFold pass but the final average should be the same. How does the KFold with shuffle really work? Each time the KFold is called, it shuffles my indexes and it generates training/test data. How does it pick the different folds for "training/testing"? Does it have a random way to pick the different folds for training/testing? Any situations where its avantageous with "shuffle" and situations that are not??

like image 344
andreSmol Avatar asked Sep 02 '12 15:09

andreSmol


1 Answers

If shuffle is True, the whole data is first shuffled and then split into the K-Folds. For repeatable behavior, you can set the random_state, for example to an integer seed (random_state=0). If your parameters depend on the shuffling, this means your parameter selection is very unstable. Probably you have very little training data or you use to little folds (like 2 or 3).

The "shuffle" is mainly useful if your data is somehow sorted by classes, because then each fold might contain only samples from one class (in particular for stochastic gradient decent classifiers sorted classes are dangerous). For other classifiers, it should make no differences. If shuffling is very unstable, your parameter selection is likely to be uninformative (aka garbage).

like image 175
Andreas Mueller Avatar answered Oct 20 '22 09:10

Andreas Mueller