Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Boxplots in matplotlib: Markers and outliers

I have some questions about boxplots in matplotlib:

Question A. What do the markers that I highlighted below with Q1, Q2, and Q3 represent? I believe Q1 is maximum and Q3 are outliers, but what is Q2?

                       enter image description here

Question B How does matplotlib identify outliers? (i.e. how does it know that they are not the true max and min values?)

like image 200
Amelio Vazquez-Reina Avatar asked Jul 18 '13 14:07

Amelio Vazquez-Reina


People also ask

Can you use box plots with outliers?

Box plots are useful as they show outliers within a data set. An outlier is an observation that is numerically distant from the rest of the data.

Should I remove outliers from Boxplot?

Removing/ ignoring outliers is generally not a good idea because highlighting outliers is generally one of the advantages of using box plots.


2 Answers

A picture is worth a thousand words. Note that the outliers (the + markers in your plot) are simply points outside of the wide [(Q1-1.5 IQR), (Q3+1.5 IQR)] margin below.

    enter image description here

However, the picture is only an example for a normally distributed data set. It is important to understand that matplotlib does not estimate a normal distribution first and calculates the quartiles from the estimated distribution parameters as shown above.

Instead, the median and the quartiles are calculated directly from the data. Thus, your boxplot may look different depending on the distribution of your data and the size of the sample, e.g., asymmetric and with more or less outliers.

like image 99
Amelio Vazquez-Reina Avatar answered Sep 17 '22 13:09

Amelio Vazquez-Reina


The box represents the first and third quartiles, with the red line the median (2nd quartile). The documentation gives the default whiskers at 1.5 IQR:

boxplot(x, notch=False, sym='+', vert=True, whis=1.5,         positions=None, widths=None, patch_artist=False,         bootstrap=None, usermedians=None, conf_intervals=None) 

and

whis : [ default 1.5 ]

Defines the length of the whiskers as a function of the inner quartile range. They extend to the most extreme data point within ( whis*(75%-25%) ) data range.

If you're confused about different box plot representations try reading the description in wikipedia.

like image 31
seth Avatar answered Sep 21 '22 13:09

seth