Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

F1 smaller than both precision and recall in Scikit-learn

I am doing multi-class classification, with unbalanced categories.

I noticed f1 is always smaller than the direct harmonic mean of precision and recall, and in some cases, f1 is even smaller than both precision and recall.

FYI, I called metrics.precision_score(y,pred) for precision and so on.

I am aware of the difference of micro/macro average, and tested that they are not micro by using the category results from precision_recall_fscore_support().

Not sure is this due to macro-average is used or some other reasons?


Updated detailed results as below:

n_samples: 75, n_features: 250

MultinomialNB(alpha=0.01, fit_prior=True)

2-fold CV:

1st run:

F1:        0.706029106029
Precision: 0.731531531532
Recall:    0.702702702703

         precision    recall  f1-score   support

      0       0.44      0.67      0.53         6
      1       0.80      0.50      0.62         8
      2       0.78      0.78      0.78        23

avg / total       0.73      0.70      0.71        37

2nd run:

F1:        0.787944219523
Precision: 0.841165413534
Recall:    0.815789473684

         precision    recall  f1-score   support

      0       1.00      0.29      0.44         7
      1       0.75      0.86      0.80         7
      2       0.82      0.96      0.88        24

avg / total       0.84      0.82      0.79        38

Overall:

Overall f1-score:   0.74699 (+/- 0.02)
Overall precision:  0.78635 (+/- 0.03)
Overall recall:     0.75925 (+/- 0.03)

Definitions about micro/macro-averaging from Scholarpedia:

In multi-label classification, the simplest method for computing an aggregate score across categories is to average the scores of all binary task. The resulted scores are called macro-averaged recall, precision, F1, etc. Another way of averaging is to sum over TP, FP, TN, FN and N over all the categories first, and then compute each of the above metrics. The resulted scores are called micro-averaged. Macro-averaging gives an equal weight to each category, and is often dominated by the system’s performance on rare categories (the majority) in a power-law like distribution. Micro-averaging gives an equal weight to each document, and is often dominated by the system’s performance on most common categories.


It is a current open issue in Github, #83.


Following example demonstrates how Micro, Macro and weighted(current in Scikit-learn) averaging may differ:

y    = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2]
pred = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 2, 0, 1, 2, 2, 2, 2]

Confusion matrix:

[[9 3 0]
 [3 5 1]
 [1 1 4]]

Wei Pre: 0.670655270655
Wei Rec: 0.666666666667
Wei F1 : 0.666801346801
Wei F5 : 0.668625356125

Mic Pre: 0.666666666667
Mic Rec: 0.666666666667
Mic F1 : 0.666666666667
Mic F5 : 0.666666666667

Mac Pre: 0.682621082621
Mac Rec: 0.657407407407
Mac F1 : 0.669777037588
Mac F5 : 0.677424801371

F5 above is a shorthand for F0.5...

like image 946
Flake Avatar asked Nov 04 '22 10:11

Flake


1 Answers

Can you please update your question with the output of:

>>> from sklearn.metrics import classification_report
>>> print classification_report(y_true, y_predicted)

That will display the precisions and recalls for each individual category along with the support and hence help us make sense of how the averaging works and decide whether this is an appropriate behavior or not.

like image 74
ogrisel Avatar answered Nov 15 '22 09:11

ogrisel