Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Equivalent of 'range' in boxplot for ggplot2

I am trying to get the whiskers of a ggplot2's geom_boxplot to cover the outliers. The outliers would de facto not be displayed as dots as they are encompassed by the boxplot.

If I was using the standard 'boxplot', I would be using:

boxplot(x, range=n)

where n would be a large number so that, instead of displaying the outliers, the boxplots's whiskers extend to cover the outliers.

How can this be done with ggplot2? I've tried:

ggplot(myDF, aes(x=x, y=y)) +
geom_boxplot(range = 5)

Note: I do not want to discard the outliers using something like:

geom_boxplot(outlier.shape = NA) 
like image 478
Ant Avatar asked Sep 03 '13 14:09

Ant


2 Answers

I suppose, that this question is still relevant, because this page in the top-3 of Google search about that outliers issue. So:

Easier way to deal with outliers is (at least in the latest ggplot as by 04 Apr 2016) is to use "coef":

... + geom_boxplot(coef = 5)

From the manual (?geom_boxplot output copy-paste below):

coef length of the whiskers as multiple of IQR. Defaults to 1.5

Details

The upper whisker extends from the hinge to the highest value that is within 1.5 * IQR of the hinge, where IQR is the inter-quartile range, or distance between the first and third quartiles. The lower whisker extends from the hinge to the lowest value within 1.5 * IQR of the hinge. Data beyond the end of the whiskers are outliers and plotted as points (as specified by Tukey).

In a notched box plot, the notches extend 1.58 * IQR / sqrt(n). This gives a roughly 95 See McGill et al. (1978) for more details.

like image 131
Dmitriy Avatar answered Oct 04 '22 00:10

Dmitriy


The only way I know of is to compute the box values yourself like this:

library(plyr)
xx <- ddply(mtcars,.(cyl),
            transform,
            ymin = min(mpg),
            ymax = max(mpg),
            middle = median(mpg),
            lower = quantile(mpg,0.25),
            upper = quantile(mpg,0.75))

ggplot(data = xx,aes(x = factor(cyl))) + 
    geom_boxplot(aes(ymin = ymin,ymax = ymax,middle = middle,upper = upper,lower= lower),
                 stat = 'identity')

There are some warnings on the ddply call, but you should be able to ignore them safely.

like image 41
joran Avatar answered Oct 04 '22 00:10

joran