Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

classification_report vs f1_score in scikit-learn's classification metrics

What is the right way to evaluate a binary classifier using scikit-learn's evaluation metrics?

Given y_test and y_pred as the gold and predicted labels, shouldn't the F1 score in the classification_report output be the same as what f1_score produces?

Here is how I do it:

print(classification_reprot(y_test, y_pred)

gives the following table:

         precision    recall  f1-score   support

      0       0.49      0.18      0.26       204
      1       0.83      0.96      0.89       877

avg / total       0.77      0.81      0.77      1081

However,

print(f1_score(y_test, y_pred) 

gives F1 score = 0.89

Now, given the above outputs, is the performance of this model F1 score = 0.89 or is it 0.77?

like image 936
user2161903 Avatar asked Mar 15 '23 09:03

user2161903


1 Answers

In short, for your case, the f1-score is 0.89, and the weighted average f1-score is 0.77.

Take a look at the docstring of sklearn.metrics.f1_score:

The F1 score can be interpreted as a weighted average of the precision and
recall, where an F1 score reaches its best value at 1 and worst score at 0.
The relative contribution of precision and recall to the F1 score are
equal. The formula for the F1 score is::

    F1 = 2 * (precision * recall) / (precision + recall)

In the multi-class and multi-label case, this is the weighted average of
the F1 score of each class.

The key is the final sentence here. If you're looking for the weighted average f1 score for each class, then you shouldn't feed the function a 0/1 binary classification. So, for example, you could do

f1_score(y_test + 1, y_pred + 1)
# 0.77

If the class labels are not 0/1, then it is treated as a multiclass metric (where you care about all precision/recall scores) rather than a binary metric (where you care about precision/recall only for positive samples). I agree that this might be a bit surprising, but in general 0/1 classes are treated as a marker of binary classification.


Edit: some of the behavior listed here is deprecated since Scikit-learn 0.16 – in particular the confusing implicit assumptions about binary vs non-binary classifications. See this github thread for details.

like image 159
jakevdp Avatar answered Apr 07 '23 19:04

jakevdp