What does KFold in python exactly do?

Tags:

I am looking at this tutorial: https://www.dataquest.io/mission/74/getting-started-with-kaggle

I got to part 9, making predictions. In there there is some data in a dataframe called titanic, which is then divided up in folds using:

# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

I am not sure what is it exactly doing and what kind of object kf is. I tried reading the documentation but it did not help much. Also, there are three folds (n_folds=3), why is it later only accessing train and test (and how do I know they are called train and test) in this line?

for train, test in kf:

843

asked Mar 17 '16 14:03

user

2 Answers

KFold will provide train/test indices to split data in train and test sets. It will split dataset into k consecutive folds (without shuffling by default).Each fold is then used a validation set once while the k - 1 remaining folds form the training set (source).

Let's say, you have some data indices from 1 to 10. If you use n_fold=k, in first iteration you will get i'th (i<=k) fold as test indices and remaining (k-1) folds (without that i'th fold) together as train indices.

An example

import numpy as np
from sklearn.cross_validation import KFold

x = [1,2,3,4,5,6,7,8,9,10,11,12]
kf = KFold(12, n_folds=3)

for train_index, test_index in kf:
    print (train_index, test_index)

Output

Fold 1: [ 4 5 6 7 8 9 10 11] [0 1 2 3]

Fold 2: [ 0 1 2 3 8 9 10 11] [4 5 6 7]

Fold 3: [0 1 2 3 4 5 6 7] [ 8 9 10 11]

Import Update for sklearn 0.20:

KFold object was moved to the sklearn.model_selection module in version 0.20. To import KFold in sklearn 0.20+ use from sklearn.model_selection import KFold. KFold current documentation source

171

answered Oct 24 '22 02:10

Quazi Marufur Rahman

Sharing theoretical information about KF that I have learnt so far.

KFOLD is a model validation technique, where it's not using your pre-trained model. Rather it just use the hyper-parameter and trained a new model with k-1 data set and test the same model on the kth set.

K different models are just used for validation.

It will return the K different scores(accuracy percentage), which are based on kth test data set. And we generally take the average to analyse the model.

We repeat this process with all the different models that we want to analyse. Brief Algo:

Split data in to training and test part.
Trained different models say SVM, RF, LR on this training data.

   2.a Take whole data set and divide in to K-Folds.
   2.b Create a new model with the hyper parameter received after training on step 1.
   2.c Fit the newly created model on K-1 data set.
   2.d Test on Kth data set
   2.e Take average score.

Analyse the different average score and select the best model out of SVM, RF and LR.

Simple reason for doing this, we generally have data deficiencies and if we divide the whole data set into:

Training
Validation
Testing

We may left out relatively small chunk of data and which may overfit our model. Also possible that some of the data remain untouched for our training and we are not analysing the behavior against such data.

KF overcome with both the issues.

answered Oct 24 '22 01:10

vipin bansal

Related questions
                            
                                Java equivalent to python all and any
                            
                                spline interpolation coefficients of a line curve in 3d space
                            
                                Django 404 error-page not found
                            
                                Python-Requests (>= 1.*): How to disable keep-alive?
                            
                                How to patch an attribute in an object
                            
                                unittest - run the same test for a list of inputs and outputs [duplicate]
                            
                                DataFrame: add column with the size of a group
                            
                                Python: can't catch an IndexError
                            
                                Set a Read-Only Attribute in Python?
                            
                                How to check whether two nodes are connected?
                            
                                Python argparse: default argument stored as string, not list
                            
                                Check if XML Element has children or not, in ElementTree
                            
                                Ansible multiple hosts with port forwarding
                            
                                Correct way of transaction.rollback() with raise exception in django
                            
                                how to adjust # of ticks on Bokeh axis (labels are overlapping on small figures)
                            
                                django 1.7.8 not sending emails with password reset
                            
                                Remove keys from object not in a list in python? [duplicate]
                            
                                Python - Most elegant way to extract a substring, being given left and right borders [duplicate]
                            
                                Tensorflow: Where is tf.nn.conv2d Actually Executed?
                            
                                django admin, extending admin with custom views

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What does KFold in python exactly do?

Tags:

python

scikit-learn

cross-validation

kaggle

user

People also ask

2 Answers

Quazi Marufur Rahman

vipin bansal

Recent Activity

Donate For Us