Macro VS Micro VS Weighted VS Samples F1 Score

Tags:

In sklearn.metrics.f1_score, the f1 score has a parameter called "average". What does macro, micro, weighted, and samples mean? Please elaborate, because in the documentation, it was not explained properly. Or simply answer the following:

Why is "samples" best parameter for multilabel classification?
Why is micro best for an imbalanced dataset?
what's the difference between weighted and macro?

364

asked Apr 18 '19 06:04

Code Geek

2 Answers

The question is about the meaning of the average parameter in sklearn.metrics.f1_score.

As you can see from the code:

average=micro says the function to compute f1 by considering total true positives, false negatives and false positives (no matter of the prediction for each label in the dataset)
average=macro says the function to compute f1 for each label, and returns the average without considering the proportion for each label in the dataset.
average=weighted says the function to compute f1 for each label, and returns the average considering the proportion for each label in the dataset.
average=samples says the function to compute f1 for each instance, and returns the average. Use it for multilabel classification.

answered Sep 23 '22 22:09

sentence

I found a really helpful article explaining the differences more thoroughly and with examples: https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1

Unfortunately, it doesn't tackle the 'samples' parameter and I did not experiment with multi-label classification yet, so I'm not able to answer question number 1. As for the others:

Where does this information come from? If I understood the differences correctly, micro is not the best indicator for an imbalanced dataset, but one of the worst since it does not include the proportions. As described in the article, micro-f1 equals accuracy which is a flawed indicator for imbalanced data. For example: The classifier is supposed to identify cat pictures among thousands of random pictures, only 1% of the data set consists of cat pictures (imbalanced data set). Even if it does not identify a single cat picture, it has an accuracy / micro-f1-score of 99%, since 99% of the data was correctly identified as not cat pictures.
Trying to put it in a nutshell: Macro is simply the arithmetic mean of the individual scores, while weighted includes the individual sample sizes. I recommend the article for details, I can provide more examples if needed.

I know that the question is quite old, but I hope this helps someone. Please correct me if I'm wrong. I've done some research, but am not an expert.

answered Sep 21 '22 22:09

3scapeX

Related questions
                            
                                Python Caesar Cipher Decoder
                            
                                Is there a way to make Fabric summarise results across a number of hosts?
                            
                                Python checking __init__ parameter
                            
                                How can I emulate ":contains" using BeautifulSoup?
                            
                                Computing the Fiedler Vector in Python
                            
                                Multiprocessing: AttributeError: StdIn instance has no attribute 'close'
                            
                                Django Logging with FileHandler not Working
                            
                                Howto validate correctness of functions which use random?
                            
                                How to make very simple http proxy using werkzeug or other python requests framework?
                            
                                Python descriptors not working in Python 2.7
                            
                                Calling d code from an interactive shell
                            
                                Convert non-UTC time string with timezone abbreviation into UTC time in python, while accounting for daylight savings
                            
                                Fabric equivalent of try finally
                            
                                ZeroMQ Poller vs Tornados EventLoop
                            
                                PyCrypto in Google App Engine development server "ImportError: cannot import name blockalgo"
                            
                                Consequences of shadowing built-in types/functions
                            
                                How to Run a Simple Airflow DAG
                            
                                Find the position of difference between two strings
                            
                                Slicing strings in str.format
                            
                                How to find the index of the nth time an item appears in a list?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Macro VS Micro VS Weighted VS Samples F1 Score

Tags:

python

python-3.x

machine-learning

metrics

scikit-learn

Code Geek

People also ask

2 Answers

sentence

3scapeX

Recent Activity

Donate For Us