I am doing multi-class classification, with unbalanced categories.
I noticed f1 is always smaller than the direct harmonic mean of precision and recall, and in some cases, f1 is even smaller than both precision and recall.
FYI, I called metrics.precision_score(y,pred)
for precision and so on.
I am aware of the difference of micro/macro average, and tested that they are not micro by using the category results from precision_recall_fscore_support()
.
Not sure is this due to macro-average is used or some other reasons?
Updated detailed results as below:
n_samples: 75, n_features: 250
MultinomialNB(alpha=0.01, fit_prior=True)
2-fold CV:
1st run:
F1: 0.706029106029
Precision: 0.731531531532
Recall: 0.702702702703
precision recall f1-score support
0 0.44 0.67 0.53 6
1 0.80 0.50 0.62 8
2 0.78 0.78 0.78 23
avg / total 0.73 0.70 0.71 37
2nd run:
F1: 0.787944219523
Precision: 0.841165413534
Recall: 0.815789473684
precision recall f1-score support
0 1.00 0.29 0.44 7
1 0.75 0.86 0.80 7
2 0.82 0.96 0.88 24
avg / total 0.84 0.82 0.79 38
Overall:
Overall f1-score: 0.74699 (+/- 0.02)
Overall precision: 0.78635 (+/- 0.03)
Overall recall: 0.75925 (+/- 0.03)
Definitions about micro/macro-averaging from Scholarpedia:
In multi-label classification, the simplest method for computing an aggregate score across categories is to average the scores of all binary task. The resulted scores are called macro-averaged recall, precision, F1, etc. Another way of averaging is to sum over TP, FP, TN, FN and N over all the categories first, and then compute each of the above metrics. The resulted scores are called micro-averaged. Macro-averaging gives an equal weight to each category, and is often dominated by the system’s performance on rare categories (the majority) in a power-law like distribution. Micro-averaging gives an equal weight to each document, and is often dominated by the system’s performance on most common categories.
It is a current open issue in Github, #83.
Following example demonstrates how Micro, Macro and weighted(current in Scikit-learn) averaging may differ:
y = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2]
pred = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 2, 0, 1, 2, 2, 2, 2]
Confusion matrix:
[[9 3 0]
[3 5 1]
[1 1 4]]
Wei Pre: 0.670655270655
Wei Rec: 0.666666666667
Wei F1 : 0.666801346801
Wei F5 : 0.668625356125
Mic Pre: 0.666666666667
Mic Rec: 0.666666666667
Mic F1 : 0.666666666667
Mic F5 : 0.666666666667
Mac Pre: 0.682621082621
Mac Rec: 0.657407407407
Mac F1 : 0.669777037588
Mac F5 : 0.677424801371
F5 above is a shorthand for F0.5...
Can you please update your question with the output of:
>>> from sklearn.metrics import classification_report
>>> print classification_report(y_true, y_predicted)
That will display the precisions and recalls for each individual category along with the support and hence help us make sense of how the averaging works and decide whether this is an appropriate behavior or not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With