I understand F1-measure is a harmonic mean of precision and recall. But what values define how good/bad a F1-measure is? I can't seem to find any references (google or academic) answering my question.
A binary classification task. Clearly, the higher the F1 score the better, with 0 being the worst possible and 1 being the best. Beyond this, most online sources don't give you any idea of how to interpret a specific F1 score. Was my F1 score of 0.56 good or bad?
Notice that F1-score takes both precision and recall into account, which also means it accounts for both FPs and FNs. The higher the precision and recall, the higher the F1-score. F1-score ranges between 0 and 1. The closer it is to 1, the better the model.
F1-score when Precision=0.8 and Recall = 0.01 to 1.0 Here precision is fixed at 0.8, while Recall varies from 0.01 to 1.0 as before: Calculating F1-Score when precision is always 0.8 and recall varies from 0.0 to 1.0.
Consider sklearn.dummy.DummyClassifier(strategy='uniform')
which is a classifier that make random guesses (a.k.a bad classifier). We can view DummyClassifier as a benchmark to beat, now let's see it's f1-score.
In a binary classification problem, with balanced dataset: 6198 total sample, 3099 samples labelled as 0
and 3099 samples labelled as 1
, f1-score is 0.5
for both classes, and weighted average is 0.5
:
Second example, using DummyClassifier(strategy='constant')
, i.e. guessing the same label every time, guessing label 1
every time in this case, average of f1-scores is 0.33
, while f1 for label 0
is 0.00
:
I consider these to be bad f1-scores, given the balanced dataset.
PS. summary generated using sklearn.metrics.classification_report
You did not find any reference for f1 measure range because there is not any range. The F1 measure is a combined matrix of precision and recall.
Let's say you have two algorithms, one has higher precision and lower recall. By this observation , you can not tell that which algorithm is better, unless until your goal is to maximize precision.
So, given this ambiguity about how to select superior algorithm among two (one with higher recall and other with higher precision), we use f1-measure to select superior among them.
f1-measure is a relative term that's why there is no absolute range to define how better your algorithm is.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With