I would like to compute the recall, precision and f-measure of a cross validation test for different classifiers. scikit-learn comes with cross_val_score but unfortunately such method does not return multiple values.
I could compute such measures by calling three times cross_val_score but that is not efficient. Is there any better solution?
By now I wrote this function:
from sklearn import metrics def mean_scores(X, y, clf, skf): cm = np.zeros(len(np.unique(y)) ** 2) for i, (train, test) in enumerate(skf): clf.fit(X[train], y[train]) y_pred = clf.predict(X[test]) cm += metrics.confusion_matrix(y[test], y_pred).flatten() return compute_measures(*cm / skf.n_folds) def compute_measures(tp, fp, fn, tn): """Computes effectiveness measures given a confusion matrix.""" specificity = tn / (tn + fp) sensitivity = tp / (tp + fn) fmeasure = 2 * (specificity * sensitivity) / (specificity + sensitivity) return sensitivity, specificity, fmeasure
It basically sums up the confusion matrix values and once you have false positive, false negative etc you can easily compute the recall, precision etc... But still I don't like this solution :)
The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset. >>> from sklearn.model_selection import cross_val_score >>> clf = svm.SVC(kernel='linear', C=1, random_state=42) >>> scores = cross_val_score(clf, X, y, cv=5) >>> scores array([0.96..., 1.
K-Folds cross-validator. Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default). Each fold is then used once as a validation while the k - 1 remaining folds form the training set.
In nested cross-validation, you have a double loop, an outer loop (that will serve for assessing the quality of the model), and an inner loop (that will serve for model/parameter selection). It's very important that those loops are independent, so each step or layer of cross-validation does one and only one thing.
... the five-fold cross-validation (CV) is a process when all data is randomly split into k folds, in our case k = 5, and then the model is trained on the k − 1 folds, while one fold is left to test a model (an example is illustrated on Fig. 9). This procedure is repeated k times.
Now in scikit-learn: cross_validate
is a new function that can evaluate a model on multiple metrics. This feature is also available in GridSearchCV
and RandomizedSearchCV
(doc). It has been merged recently in master and will be available in v0.19.
From the scikit-learn doc:
The
cross_validate
function differs fromcross_val_score
in two ways: 1. It allows specifying multiple metrics for evaluation. 2. It returns a dict containing training scores, fit-times and score-times in addition to the test score.
The typical use case goes by:
from sklearn.svm import SVC from sklearn.datasets import load_iris from sklearn.model_selection import cross_validate iris = load_iris() scoring = ['precision', 'recall', 'f1'] clf = SVC(kernel='linear', C=1, random_state=0) scores = cross_validate(clf, iris.data, iris.target == 1, cv=5, scoring=scoring, return_train_score=False)
See also this example.
The solution you present represents exactly the functionality of cross_val_score
, perfectly adapted to your situation. It seems like the right way to go.
cross_val_score
takes the argument n_jobs=
, making the evaluation parallelizeable. If this is something you need, you should look into replacing your for loop with a parallel loop, using sklearn.externals.joblib.Parallel
.
On a more general note, a discussion is going on about the problem of multiple scores in the issue tracker of scikit learn. A representative thread can be found here. So while it looks like future versions of scikit-learn will permit multiple outputs of scorers, as of now, this is impossible.
A hacky (disclaimer!) way to get around this is to change the code in cross_validation.py
ever so slightly, by removing a condition check on whether your score is a number. However, this suggestion is very version dependent, so I will present it for version 0.14
.
1) In IPython, type from sklearn import cross_validation
, followed by cross_validation??
. Note the filename that is displayed and open it in an editor (you may need root priviliges).
2) You will find this code, where I have already tagged the relevant line (1066). It says
if not isinstance(score, numbers.Number): raise ValueError("scoring must return a number, got %s (%s)" " instead." % (str(score), type(score)))
These lines need to be removed. In order to keep track of what was there once (if ever you want to change back), replace it with the following
if not isinstance(score, numbers.Number): pass # raise ValueError("scoring must return a number, got %s (%s)" # " instead." % (str(score), type(score)))
If what your scorer returns doesn't make cross_val_score
choke elsewhere, this should resolve your issue. Please let me know if this is the case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With