Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn - Cross validation with multiple scores

I would like to compute the recall, precision and f-measure of a cross validation test for different classifiers. scikit-learn comes with cross_val_score but unfortunately such method does not return multiple values.

I could compute such measures by calling three times cross_val_score but that is not efficient. Is there any better solution?

By now I wrote this function:

from sklearn import metrics  def mean_scores(X, y, clf, skf):      cm = np.zeros(len(np.unique(y)) ** 2)     for i, (train, test) in enumerate(skf):         clf.fit(X[train], y[train])         y_pred = clf.predict(X[test])         cm += metrics.confusion_matrix(y[test], y_pred).flatten()      return compute_measures(*cm / skf.n_folds)  def compute_measures(tp, fp, fn, tn):      """Computes effectiveness measures given a confusion matrix."""      specificity = tn / (tn + fp)      sensitivity = tp / (tp + fn)      fmeasure = 2 * (specificity * sensitivity) / (specificity + sensitivity)      return sensitivity, specificity, fmeasure 

It basically sums up the confusion matrix values and once you have false positive, false negative etc you can easily compute the recall, precision etc... But still I don't like this solution :)

like image 587
blueSurfer Avatar asked Apr 28 '14 11:04

blueSurfer


People also ask

How do you cross validate with sklearn?

The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset. >>> from sklearn.model_selection import cross_val_score >>> clf = svm.SVC(kernel='linear', C=1, random_state=42) >>> scores = cross_val_score(clf, X, y, cv=5) >>> scores array([0.96..., 1.

What is Kfold in sklearn?

K-Folds cross-validator. Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default). Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

What is nested cross-validation?

In nested cross-validation, you have a double loop, an outer loop (that will serve for assessing the quality of the model), and an inner loop (that will serve for model/parameter selection). It's very important that those loops are independent, so each step or layer of cross-validation does one and only one thing.

What is five fold cross-validation?

... the five-fold cross-validation (CV) is a process when all data is randomly split into k folds, in our case k = 5, and then the model is trained on the k − 1 folds, while one fold is left to test a model (an example is illustrated on Fig. 9). This procedure is repeated k times.


2 Answers

Now in scikit-learn: cross_validate is a new function that can evaluate a model on multiple metrics. This feature is also available in GridSearchCV and RandomizedSearchCV (doc). It has been merged recently in master and will be available in v0.19.

From the scikit-learn doc:

The cross_validate function differs from cross_val_score in two ways: 1. It allows specifying multiple metrics for evaluation. 2. It returns a dict containing training scores, fit-times and score-times in addition to the test score.

The typical use case goes by:

from sklearn.svm import SVC from sklearn.datasets import load_iris from sklearn.model_selection import cross_validate iris = load_iris() scoring = ['precision', 'recall', 'f1'] clf = SVC(kernel='linear', C=1, random_state=0) scores = cross_validate(clf, iris.data, iris.target == 1, cv=5,                         scoring=scoring, return_train_score=False) 

See also this example.

like image 78
TomDLT Avatar answered Sep 16 '22 13:09

TomDLT


The solution you present represents exactly the functionality of cross_val_score, perfectly adapted to your situation. It seems like the right way to go.

cross_val_score takes the argument n_jobs=, making the evaluation parallelizeable. If this is something you need, you should look into replacing your for loop with a parallel loop, using sklearn.externals.joblib.Parallel.

On a more general note, a discussion is going on about the problem of multiple scores in the issue tracker of scikit learn. A representative thread can be found here. So while it looks like future versions of scikit-learn will permit multiple outputs of scorers, as of now, this is impossible.

A hacky (disclaimer!) way to get around this is to change the code in cross_validation.py ever so slightly, by removing a condition check on whether your score is a number. However, this suggestion is very version dependent, so I will present it for version 0.14.

1) In IPython, type from sklearn import cross_validation, followed by cross_validation??. Note the filename that is displayed and open it in an editor (you may need root priviliges).

2) You will find this code, where I have already tagged the relevant line (1066). It says

    if not isinstance(score, numbers.Number):         raise ValueError("scoring must return a number, got %s (%s)"                          " instead." % (str(score), type(score))) 

These lines need to be removed. In order to keep track of what was there once (if ever you want to change back), replace it with the following

    if not isinstance(score, numbers.Number):         pass         # raise ValueError("scoring must return a number, got %s (%s)"         #                 " instead." % (str(score), type(score))) 

If what your scorer returns doesn't make cross_val_score choke elsewhere, this should resolve your issue. Please let me know if this is the case.

like image 21
eickenberg Avatar answered Sep 17 '22 13:09

eickenberg