What's the difference between KFold and ShuffleSplit CV?

Tags:

It seems like KFold generates the same values every time the object is iterated over, while Shuffle Split generates different indices every time. Is this correct? If so, what are the uses for one over the other?

cv = cross_validation.KFold(10, n_folds=2,shuffle=True,random_state=None) cv2 = cross_validation.ShuffleSplit(10,n_iter=2,test_size=0.5) print(list(iter(cv))) print(list(iter(cv))) print(list(iter(cv2))) print(list(iter(cv2)))

Yields the following output:

[(array([1, 3, 5, 8, 9]), array([0, 2, 4, 6, 7])), (array([0, 2, 4, 6, 7]), array([1, 3, 5, 8, 9]))]                                      [(array([1, 3, 5, 8, 9]), array([0, 2, 4, 6, 7])), (array([0, 2, 4, 6, 7]), array([1, 3, 5, 8, 9]))]                                      [(array([4, 6, 3, 2, 7]), array([8, 1, 9, 0, 5])), (array([3, 6, 7, 0, 5]), array([9, 1, 8, 4, 2]))]                                      [(array([3, 0, 2, 1, 7]), array([5, 6, 9, 4, 8])), (array([0, 7, 1, 3, 8]), array([6, 2, 5, 4, 9]))]

298

asked Jan 11 '16 21:01

rb612

1 Answers

Difference in KFold and ShuffleSplit output

KFold will divide your data set into prespecified number of folds, and every sample must be in one and only one fold. A fold is a subset of your dataset.

ShuffleSplit will randomly sample your entire dataset during each iteration to generate a training set and a test set. The test_size and train_size parameters control how large the test and training test set should be for each iteration. Since you are sampling from the entire dataset during each iteration, values selected during one iteration, could be selected again during another iteration.

Summary: ShuffleSplit works iteratively, KFold just divides the dataset into k folds.

Difference when doing validation

In KFold, during each round you will use one fold as the test set and all the remaining folds as your training set. However, in ShuffleSplit, during each round n you should only use the training and test set from iteration n. As your data set grows, cross validation time increases, making shufflesplits a more attractive alternate. If you can train your algorithm, with a certain percentage of your data as opposed to using all k-1 folds, ShuffleSplit is an attractive option.

108

answered Oct 19 '22 04:10

ilyas patanam

Related questions
                            
                                "UnboundLocalError: local variable referenced before assignment" after an if statement
                            
                                Pandas GroupBy.apply method duplicates first group
                            
                                Convert Gregorian (Christian) date to Persian date and vice-versa in Python
                            
                                Calling Java/Scala function from a task
                            
                                Folder naming convention for python projects
                            
                                Is there a unicode-ready substitute I can use for urllib.quote and urllib.unquote in Python 2.6.5?
                            
                                Sklearn How to Save a Model Created From a Pipeline and GridSearchCV Using Joblib or Pickle?
                            
                                How to import a module from a different folder?
                            
                                Inconsistent behaviour between dict.values() and dict.keys() equality in Python 3.x and Python 2.7
                            
                                Partially transparent scatter plot, but with a solid color bar
                            
                                Dealing with duplicate primary keys on insert in SQLAlchemy (declarative style)
                            
                                Specifying optional dependencies in pypi python setup.py
                            
                                how to filter duplicate requests based on url in scrapy
                            
                                Merge two python pandas data frames of different length but keep all rows in output data frame
                            
                                How to draw a rectangle over a specific region in a matplotlib graph
                            
                                upgade python version using pip
                            
                                Pip install from pypi works, but from testpypi fails (cannot find requirements)
                            
                                Booleans have two possible values. Are there types that have three possible values? [duplicate]
                            
                                Manually set color of points in legend
                            
                                Random number in the range 1 to sys.maxsize is always 1 mod 2^10

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the difference between KFold and ShuffleSplit CV?

Tags:

python

scipy

scikit-learn

rb612

People also ask

1 Answers

ilyas patanam

Recent Activity

Donate For Us