Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-learn cross validation scoring for regression

How can one use cross_val_score for regression? The default scoring seems to be accuracy, which is not very meaningful for regression. Supposedly I would like to use mean squared error, is it possible to specify that in cross_val_score?

Tried the following two but doesn't work:

scores = cross_validation.cross_val_score(svr, diabetes.data, diabetes.target, cv=5, scoring='mean_squared_error')  

and

scores = cross_validation.cross_val_score(svr, diabetes.data, diabetes.target, cv=5, scoring=metrics.mean_squared_error) 

The first one generates a list of negative numbers while mean squared error should always be non-negative. The second one complains that:

mean_squared_error() takes exactly 2 arguments (3 given) 
like image 479
clwen Avatar asked Jun 10 '14 03:06

clwen


People also ask

Can cross-validation be used for regression?

These methods include: validation set approach, leave-one-out cross-validation, k-fold cross-validation and repeated k-fold cross-validation. We generally recommend the (repeated) k-fold cross-validation to estimate the prediction error rate. It can be used in regression and classification settings.

What is cross Val score sklearn?

Cross_val_score in sklearn, what is it? Cross_val_score is a function in the scikit-learn package which trains and tests a model over multiple folds of your dataset. This cross validation method gives you a better understanding of model performance over the whole dataset instead of just a single train/test split.

How use cross-validation scikit-learn?

The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset. >>> from sklearn. model_selection import cross_val_score >>> clf = svm.


2 Answers

I dont have the reputation to comment but I want to provide this link for you and/or a passersby where the negative output of the MSE in scikit learn is discussed - https://github.com/scikit-learn/scikit-learn/issues/2439

In addition (to make this a real answer) your first option is correct in that not only is MSE the metric you want to use to compare models but R^2 cannot be calculated depending (I think) on the type of cross-val you are using.

If you choose MSE as a scorer, it outputs a list of errors which you can then take the mean of, like so:

# Doing linear regression with leave one out cross val  from sklearn import cross_validation, linear_model import numpy as np  # Including this to remind you that it is necessary to use numpy arrays rather  # than lists otherwise you will get an error X_digits = np.array(x) Y_digits = np.array(y)  loo = cross_validation.LeaveOneOut(len(Y_digits))  regr = linear_model.LinearRegression()  scores = cross_validation.cross_val_score(regr, X_digits, Y_digits, scoring='mean_squared_error', cv=loo,)  # This will print the mean of the list of errors that were output and  # provide your metric for evaluation print scores.mean() 
like image 108
Sirrah Avatar answered Oct 12 '22 07:10

Sirrah


The first one is correct. It outputs the negative of the MSE, as it always tries to maximize the score. Please help us by suggesting an improvement to the documentation.

like image 20
Andreas Mueller Avatar answered Oct 12 '22 07:10

Andreas Mueller