Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ggplot2 boxplot medians aren't plotting as expected

So, I have a fairly large dataset (Dropbox: csv file) that I'm trying to plot using geom_boxplot. The following produces what appears to be a reasonable plot:

require(reshape2)
require(ggplot2)
require(scales)
require(grid)
require(gridExtra)

df <- read.csv("\\Downloads\\boxplot.csv", na.strings = "*")
df$year <- factor(df$year, levels = c(2010,2011,2012,2013,2014), labels = c(2010,2011,2012,2013,2014))

d <- ggplot(data = df, aes(x = year, y = value)) +
    geom_boxplot(aes(fill = station)) + 
    facet_grid(station~.) +
    scale_y_continuous(limits = c(0, 15)) + 
    theme(legend.position = "none"))
d

However, when you dig a little deeper, problems creep in that freak me out. When I labeled the boxplot medians with their values, the following plot results.

df.m <- aggregate(value~year+station, data = df, FUN = function(x) median(x))
d <- d + geom_text(data = df.m, aes(x = year, y = value, label = value)) 
d

boxplots-with-medians-labelled

The medians plotted by geom_boxplot aren't at the medians at all. The labels are plotted at the correct y-axis value, but the middle hinge of the boxplots are definitely not at the medians. I've been stumped by this for a few days now.

What is the reason for this? How can this type of display be produced with correct medians? How can this plot be debugged or diagnosed?

like image 635
Ryan Pugh Avatar asked Mar 27 '15 15:03

Ryan Pugh


People also ask

What is the difference between a box plot and ggplot?

A box plot in base R is used to summarise the distribution of a continuous variable. It can also be used to display the mean of each group. Means or medians can also be computed using a boxplot by labeling points. The ggplot method in R is used to do graph visualizations using the specified data frame.

How to add mean points to a box plot?

The function stat_summary () can be used to add mean points to a box plot : Dots (or points) can be added to a box plot using the functions geom_dotplot () or geom_jitter () : Box plot line colors can be automatically controlled by the levels of the variable dose :

How do I plot a boxplot in R?

To plot a boxplot, you’ll call the ggplot function. Inside the function, you’ll have the data parameter, the x and y parameter (which are typically called inside the aes function). And finally you have the geom_boxplot function. Let’s talk about each of these.

What do the X and Y parameters do in ggplot2?

The x and y parameters enable you to specify the variables that you want to map to the x-axis and y-axis, respectively. Note that these parameters are called inside of the aes () function. Remember that in the ggplot2 system, the the aes () function specifies how we map variables to aesthetic attributes of the plot.


1 Answers

The solution to this question is in the application of scale_y_continuous. ggplot2 will perform operations in the following order:

  1. Scale Transformations
  2. Statistical Computations
  3. Coordinate Transformations

In this case, because a scale transformation is invoked, ggplot2 excludes data outside the scale limits for the statistical computation of the boxplot hinges. The medians calculated by the aggregate function and used in the geom_text instruction will use the entire dataset, however. This can result in different median hinges and text labels.

The solution is to omit the scale_y_continuous instruction and instead use:

d <- ggplot(data = df, aes(x = year, y = value)) +
geom_boxplot(aes(fill = station)) + 
facet_grid(station~.) +
theme(legend.position = "none")) +
coord_cartesian(y = c(0,15))

This allows ggplot2 to calculate the boxplot hinge stats using the entire dataset, while limiting the plot size of the figure.

like image 124
Ryan Pugh Avatar answered Oct 03 '22 11:10

Ryan Pugh