Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to mimic geom_boxplot() with outliers using geom_boxplot(stat = "identity")

I would like to pre-compute by-variable summaries of data (with plyr and passing a quantile function) and then plot with geom_boxplot(stat = "identity"). This works great except it (a) does not plot outliers as points and (b) extends the "whiskers" to the max and min of the data being plotted.

Example:

library(plyr)
library(ggplot2)

set.seed(4)
df <- data.frame(fact = sample(letters[1:2], 12, replace = TRUE),
                 val  = c(1:10, 100, 101))
df
#    fact val
# 1     b   1
# 2     a   2
# 3     a   3
# 4     a   4
# 5     b   5
# 6     a   6
# 7     b   7
# 8     b   8
# 9     b   9
# 10    a  10
# 11    b 100
# 12    a 101

by.fact.df <- ddply(df, c("fact"), function(x) quantile(x$val))

by.fact.df
#   fact 0%  25% 50%  75% 100%
# 1    a  2 3.25 5.0 9.00  101
# 2    b  1 5.50 7.5 8.75  100

# What I can do...with faults (a) and (b) above
ggplot(by.fact.df, 
       aes(x = fact, ymin = `0%`, lower = `25%`, middle = `50%`, 
           upper = `75%`,  ymax = `100%`)) +
  geom_boxplot(stat = "identity")

# What I want...
ggplot(df, aes(x = fact, y = val)) +
  geom_boxplot()

What I can do...with faults (a) and (b) mentioned above:

Plot 01

What I would like to obtain, but still leverage pre-computation via plyr (or other method):

Plot 02

Initial Thoughts: Perhaps there is some way to pre-compute the true end-points of the whiskers without the outliers? Then, subset the data for outliers and pass them as geom_point()?

Motivation: When working with larger datasets, I have found it faster and more practical to leverage plyr, dplyr, and/or data.table to pre-compute the stats and then plot them rather than having ggplot2 to the calculations.

UPDATE

I am able to extract what I need with the following mix of dplyr and plyr code, but I'm not sure if this is the most efficient way:

df %>%
  group_by(fact) %>%
  do(ldply(boxplot.stats(.$val), data.frame))

Source: local data frame [6 x 3]
Groups: fact

  fact   .id X..i..
1    a stats      2
2    a stats      4
3    a stats     10
4    a stats     13
5    a stats     16
6    a     n      9
like image 581
JasonAizkalns Avatar asked May 06 '15 13:05

JasonAizkalns


2 Answers

Here's my answer, using built-in functions quantile and boxplot.stats.

geom_boxplot does the calcualtions for boxplot slightly differently than boxplot.stats. Read ?geom_boxplot and ?boxplot.stats to understand my implementation below

#Function to calculate boxplot stats to match ggplot's implemention as in geom_boxplot.
my_boxplot.stats <-function(x){
        quantiles <-quantile(x, c(0, 0.25, 0.5, 0.75, 1))
        labels <-names(quantile(x))
        #replacing the upper whisker to geom_boxplot
        quantiles[5] <-boxplot.stats(x)$stats[5]
        res <-data.frame(rbind(quantiles))
        names(res) <-labels
        res$out <-boxplot.stats(x)$out
        return(res)
    }

Code to calculate the stats and plot it

library(dplyr)
df %>% group_by(fact) %>% do(my_boxplot.stats(.$val)) %>% 
      ggplot(aes(x=fact, y=out, ymin = `0%`, lower = `25%`, middle = `50%`,
                 upper = `75%`,  ymax = `100%`)) +
      geom_boxplot(stat = "identity") + geom_point()
like image 146
infominer Avatar answered Sep 30 '22 01:09

infominer


To get the correct statistics, you have to do some more calculations than just finding the quantiles. The geom_boxplot function with stat = "identity" does not draw the outliers. So you have to calculate the statistics without the outliers and then use geom_point to draw the outliers seperately. The following function (basically a simplified version of stat_boxplot) is probably not the most efficient, but it gives the desired result:

box.df <- df %>% group_by(fact) %>% do({
  stats <- as.numeric(quantile(.$val, c(0, 0.25, 0.5, 0.75, 1)))
  iqr <- diff(stats[c(2, 4)])
  coef <- 1.5
  outliers <- .$val < (stats[2] - coef * iqr) | .$val > (stats[4] + coef * iqr)
  if (any(outliers)) {
    stats[c(1, 5)] <- range(c(stats[2:4], .$val[!outliers]), na.rm=TRUE)
  }
  outlier_values = .$val[outliers]
  if (length(outlier_values) == 0) outlier_values <- NA_real_
  res <- as.list(t(stats))
  names(res) <- c("lower.whisker", "lower.hinge", "median", "upper.hinge", "upper.whisker")
  res$out <- outlier_values
  as.data.frame(res)
})
box.df
## Source: local data frame [2 x 7]
## Groups: fact
## 
##   fact lower.whisker lower.hinge median upper.hinge upper.whisker out
## 1    a             2        3.25    5.0        9.00            10 101
## 2    b             1        5.50    7.5        8.75             9 100

ggplot(box.df, aes(x = fact, y = out, middle = median,
                   ymin = lower.whisker, ymax = upper.whisker,
                   lower = lower.hinge, upper = upper.hinge)) +
  geom_boxplot(stat = "identity") + 
  geom_point()
like image 44
shadow Avatar answered Sep 30 '22 00:09

shadow