I'd like a box plot that looks just like the one below. But instead of the default, I'd like to present (1) 95% confidence intervals and (2) without the outliers.
The 95% confidence intervals could mean (i) extending the boxes and removing the whiskers, or (ii) having just a mean and whiskers, and removing the boxes. Or if people have other ideas for presenting 95% confidence intervals in a plot like this, I'm open to suggestions. The final goals is to show mean and conf intervals for data across multiple categories on the same plot.
set.seed(1234)
df <- data.frame(cond = factor( rep(c("A","B"), each=200) ),
rating = c(rnorm(200),rnorm(200, mean=.8))
ggplot(df, aes(x=cond, y=rating, fill=cond)) + geom_boxplot() +
guides(fill=FALSE) + coord_flip()
Image and code source: http://www.cookbook-r.com/Graphs/Plotting_distributions_(ggplot2)/
The notched boxplot allows you to evaluate confidence intervals (by default 95 percent confidence interval) for the medians of each boxplot.
Notch in box plots is 95% confidence interval for median; whiskers exclude outliers. Horizontal black lines are global medians; green line in ASR/N plot highlights ASR = N.
I see no "versus" here. Box plots show the entire distribution, summarized. You say you find them helpful. Confidence intervals arise when your concern is to estimate some parameter, say the mean of a variable, but quite possibly something else.
I've used the following to show a 95% interval. Based on what I've read it's not an uncommon use of box and whisker, but it's not the default, so you do need to make it clear what you're showing in the graph.
quantiles_95 <- function(x) {
r <- quantile(x, probs=c(0.05, 0.25, 0.5, 0.75, 0.95))
names(r) <- c("ymin", "lower", "middle", "upper", "ymax")
r
}
ggplot(df, aes(x=cond, y=rating, fill=cond)) +
guides(fill=F) +
coord_flip() +
stat_summary(fun.data = quantiles_95, geom="boxplot")
Instead of use geom_boxplot
, use stat_summary
with a custom function that specifies the limits you want to use:
"ymin"
is the lower limit of the lower whisker"lower"
is the lower limit of the lower box"middle"
is the middle of the box (typically the median)"upper"
is the upper limit of the upper box"ymax"
is the upper limit of the upper whisker.In the provided function (quantiles_95
), the builtin quantile
function is used with custom probs
argument. As given, the whiskers will span 90% of your data: from the bottom 5% to the upper 95%. The boxes will span the middle two quartiles, as usual, from 25% to 75%.
You can always change the custom function to choose different quantiles (or even to not use quantiles), but you need to be very careful with this. As pointed out in a comment, there is a certain expectation when one sees a box and whisker plot. If you're using the same shape plot to convey different information, you're likely to confuse people.
If you want to get rid of the whiskers, make the "ymin"
equal to "lower"
and the "ymax"
equal to "upper"
. If you want to have all whiskers and no box, set "upper"
and "lower"
both equal to "middle"
(or just use geom_errorbars
).
You can hide the outliers by setting the size to 0:
ggplot(df, aes(x=cond, y=rating, fill=cond)) +
geom_boxplot(outlier.size = 0) +
guides(fill=FALSE) + coord_flip()
You can add the mean to the plot with the stat_summary
function:
ggplot(df, aes(x=cond, y=rating, fill=cond)) +
geom_boxplot(outlier.size = 0) +
stat_summary(fun.y="mean", geom="point", shape=23, size=4, fill="white") +
guides(fill=FALSE) +
coord_flip()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With