Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"scoring must return a number" cross_val_score error in scikit-learn

Maybe it is a dumb question, but I don't understand the error that the function cross_val_score in the code below give me. Perhaps the answer is in the format of X sample, seeing that this is exactly what was shown in the crash message, but I don't know how to fix. This is a piece of code from my project with some random values.

import numpy as np
from sklearn import mixture,cross_validation

np.random.seed(0)
n_samples = 300
C = np.array([[0., -0.7], [3.5, .7]])
X = np.r_[np.dot(np.random.randn(n_samples, 2), C),
          np.random.randn(n_samples, 2) + np.array([20, 20])]

clf = mixture.GMM(n_components=2, covariance_type='full')
score = cross_validation.cross_val_score(clf, X)

Gives me the error:

ValueError: scoring must return a number, got (<type 'numpy.ndarray'>) instead
like image 613
Mike Avatar asked Apr 20 '15 17:04

Mike


1 Answers

I think this may be an issue in scikit. cross_val_score ultimately makes a call to the score function for whatever estimator is passed to it. Typically, score (e.g. in KMeans) returns a float. And when a KMeans estimator is passed to cross_val_score, all is well:

>>> clf = cluster.KMeans()
>>> score = cross_validation.cross_val_score(clf, X)
# (no error)    

Note the return type of score:

>>> clf = cluster.KMeans()
>>> clf.fit(X)
>>> type(clf.score(X))
numpy.float64

When score is called on a GMM an array is returned.

>>> clf = mixture.GMM()
>>> clf.fit(X)
>>> type(clf.score(X))
numpy.ndarray

Because cross_val_score is relying on clf.score() returning a float, the error message you see would make sense.

A workaround is to supply cross_val_score with your own scorer. For example, to take the average of the scores returned by GMM.score(), create this scoring function:

>>> scorer = lambda est, data: np.mean(est.score(data))

Then you can pass this scorer as an argument to cross_val_score:

>>> score = cross_validation.cross_val_score(clf, X, scoring=scorer)

This avoids the error, and I think should more or less do what you're looking for. I'm not sure if the mean is necessarily the best way to summarize the scores, though it seems reasonable enough. But from here you can define your own method.

like image 69
mattsilver Avatar answered Oct 20 '22 21:10

mattsilver