So, I have a fairly large dataset (Dropbox: csv file) that I'm trying to plot using geom_boxplot
. The following produces what appears to be a reasonable plot:
require(reshape2)
require(ggplot2)
require(scales)
require(grid)
require(gridExtra)
df <- read.csv("\\Downloads\\boxplot.csv", na.strings = "*")
df$year <- factor(df$year, levels = c(2010,2011,2012,2013,2014), labels = c(2010,2011,2012,2013,2014))
d <- ggplot(data = df, aes(x = year, y = value)) +
geom_boxplot(aes(fill = station)) +
facet_grid(station~.) +
scale_y_continuous(limits = c(0, 15)) +
theme(legend.position = "none"))
d
However, when you dig a little deeper, problems creep in that freak me out. When I labeled the boxplot medians with their values, the following plot results.
df.m <- aggregate(value~year+station, data = df, FUN = function(x) median(x))
d <- d + geom_text(data = df.m, aes(x = year, y = value, label = value))
d
The medians plotted by geom_boxplot aren't at the medians at all. The labels are plotted at the correct y-axis value, but the middle hinge of the boxplots are definitely not at the medians. I've been stumped by this for a few days now.
What is the reason for this? How can this type of display be produced with correct medians? How can this plot be debugged or diagnosed?
A box plot in base R is used to summarise the distribution of a continuous variable. It can also be used to display the mean of each group. Means or medians can also be computed using a boxplot by labeling points. The ggplot method in R is used to do graph visualizations using the specified data frame.
The function stat_summary () can be used to add mean points to a box plot : Dots (or points) can be added to a box plot using the functions geom_dotplot () or geom_jitter () : Box plot line colors can be automatically controlled by the levels of the variable dose :
To plot a boxplot, you’ll call the ggplot function. Inside the function, you’ll have the data parameter, the x and y parameter (which are typically called inside the aes function). And finally you have the geom_boxplot function. Let’s talk about each of these.
The x and y parameters enable you to specify the variables that you want to map to the x-axis and y-axis, respectively. Note that these parameters are called inside of the aes () function. Remember that in the ggplot2 system, the the aes () function specifies how we map variables to aesthetic attributes of the plot.
The solution to this question is in the application of scale_y_continuous
. ggplot2 will perform operations in the following order:
In this case, because a scale transformation is invoked, ggplot2 excludes data outside the scale limits for the statistical computation of the boxplot hinges. The medians calculated by the aggregate
function and used in the geom_text
instruction will use the entire dataset, however. This can result in different median hinges and text labels.
The solution is to omit the scale_y_continuous
instruction and instead use:
d <- ggplot(data = df, aes(x = year, y = value)) +
geom_boxplot(aes(fill = station)) +
facet_grid(station~.) +
theme(legend.position = "none")) +
coord_cartesian(y = c(0,15))
This allows ggplot2 to calculate the boxplot hinge stats using the entire dataset, while limiting the plot size of the figure.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With