Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need help understanding cross_val_score in sklearn python

I am currently trying to implement K-FOLD cross validation in classification using sklearn in python. I understand the basic concept behind K-FOLD and cross validation. However, I dont understand what is the cross_val_score and what does it do and what role does the CV iteration have in getting the array of scores we get. Below are the examples from the official documentation page of sklearn.

**Example 1**
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_score
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
lasso = linear_model.Lasso()
print(cross_val_score(lasso, X, y, cv=3))  
***OUPUT***
[0.33150734 0.08022311 0.03531764]

Taking a look at Example 1, the output generates 3 values in an array. I know that when we use kfold, n_split is the command that generates number of folds. So what does cv do in this example?

**My Code**
kf = KFold(n_splits=4,random_state=seed,shuffle=False)
print('Get_n_splits',kf.get_n_splits(X),'\n\n')
for train_index, test_index in kf.split(X):
print('TRAIN:', train_index, 'TEST:', test_index)
x_train, x_test = df.iloc[train_index], df.iloc[test_index]
y_train, y_test = df.iloc[train_index], df.iloc[test_index]

print('\n\n')

# use train_test_split to split into training and testing data
x_train, x_test, y_train, y_test = cross_validation.train_test_split(X, y,test_size=0.25,random_state=0)

# fit / train the model using the training data
clf = BernoulliNB()
model = clf.fit(x_train, y_train)
y_predicted = clf.predict(x_test)

scores = cross_val_score(model, df, y, cv=4)
print('\n\n')
print('Bernoulli Naive Bayes Classification Cross-validated Scores:', scores)
print('\n\n')

Looking at My Code, I am using 4 Fold cross validation for Bernoulli Naive Bayes Classifier and am using cv=4 in score as below : scores = cross_val_score(model, df, y, cv=4) The above line gives me an array of 4 values. However, if I change it to cv= 8 as below : scores = cross_val_score(model, df, y, cv=8) then an array of 8 values is generated as output. So again, what does cv do here.

I did read the documentation over and over again and searched numerous websites but since I am a newbie, I really don't understand what cv does and how the scores are generated.

Any and all help would be really appreciated.

Thanks in advance

like image 375
Stevi G Avatar asked Oct 02 '18 15:10

Stevi G


People also ask

What is Cross_val_score in sklearn?

Cross_val_score in sklearn, what is it? Cross_val_score is a function in the scikit-learn package which trains and tests a model over multiple folds of your dataset. This cross validation method gives you a better understanding of model performance over the whole dataset instead of just a single train/test split.

How is Cross_val_score calculated?

"cross_val_score" splits the data into say 5 folds. Then for each fold it fits the data on 4 folds and scores the 5th fold. Then it gives you the 5 scores from which you can calculate a mean and variance for the score. You crossval to tune parameters and get an estimate of the score.

What is the difference between KFold and Cross_val_score?

cross_val_score is a function which evaluates a data and returns the score. On the other hand, KFold is a class, which lets you to split your data to K folds.

What is Cross_val_score used for?

The cross_val_score() function will be used to perform the evaluation, taking the dataset and cross-validation configuration and returning a list of scores calculated for each fold.


1 Answers

In a K-FOLD Cross Validation, the following procedure is followed as follows:

  1. Model is trained using K-1 of the folds as training data
  2. Resulting Model is validated on the remaining data

This process is repeated K times and performance measure such as "ACCURACY" is computed at each step.

Please look at the image below to get a clear picture. It is taken from Cross Validation module of Scikit-Learn.

Cross Validation

>>> from sklearn.model_selection import cross_val_score
>>> clf = svm.SVC(kernel='linear', C=1)
>>> scores = cross_val_score(clf, iris.data, iris.target, cv=5)
>>> scores                                              
array([0.96..., 1.  ..., 0.96..., 0.96..., 1.        ])
>>> print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.98 (+/- 0.03)

Here the single mean Score is calculated. By default, the score computed at each CV iteration is the score method of the estimator.

I have taken help from the links mentioned below.

  1. "https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score"

  2. 'https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation'

like image 113
kamranisg Avatar answered Oct 17 '22 00:10

kamranisg