require(ggplot2)
require(cowplot)
d = iris
ggplot2::ggplot(d, aes(factor(0), Sepal.Length)) +
geom_violin(fill="black", alpha=0.2, draw_quantiles = c(0.25, 0.5, 0.75)
, colour = "red", size = 1.5) +
stat_boxplot(geom ='errorbar', width = 0.1)+
geom_boxplot(width = 0.2)+
facet_grid(. ~ Species, scales = "free_x") +
xlab("") +
ylab (expression(paste("Value"))) +
coord_cartesian(ylim = c(3.5,9.5)) +
scale_y_continuous(breaks = seq(4, 9, 1)) +
theme(axis.text.x=element_blank(),
axis.text.y = element_text(size = rel(1.5)),
axis.ticks.x = element_blank(),
strip.background=element_rect(fill="black"),
strip.text=element_text(color="white", face="bold"),
legend.position = "none") +
background_grid(major = "xy", minor = "none")
To my knowledge box ends in boxplots represent the 25% and 75% quantile, respectively, and the median = 50%. So they should be equal to the 0.25/0.5/0.75 quantiles which are drawn by geom_violin
in the draw_quantiles = c(0.25, 0.5, 0.75)
argument.
Median and 50% quantile fit. However, both 0.25 and 0.75 quantile do not fit the box ends of the boxplot (see figure, especially 'virginica' facet).
References:
http://docs.ggplot2.org/current/geom_violin.html
http://docs.ggplot2.org/current/geom_boxplot.html
A violin plot is more informative than a plain box plot. While a box plot only shows summary statistics such as mean/median and interquartile ranges, the violin plot shows the full distribution of the data. The difference is particularly useful when the data distribution is multimodal (more than one peak).
Unlike a box plot that can only show summary statistics, violin plots depict summary statistics and the density of each variable.
Violin plots are used when you want to observe the distribution of numeric data, and are especially useful when you want to make a comparison of distributions between multiple groups. The peaks, valleys, and tails of each group's density curve can be compared to see where groups are similar or different.
Only the y values of the points are visualized in the violin plot. The width of the violin at a given y value represents the point density at that y value. Technically, a violin plot is a density estimate rotated by 90 degrees and then mirrored. Violins are therefore symmetric.
This is too long for a comment, so I post it as an answer. I see two potential sources for the divergence. First, my understanding is that the boxplot
refers to boxplot.stats
, which uses hinges
that are very similar but not necessarily identical to the quantiles. ?boxplot.stats
says:
The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4). The hinges equal the quartiles for odd n (where n <- length(x)) and differ for even n. Whereas the quartiles only equal observations for n %% 4 == 1 (n = 1 mod 4), the hinges do so additionally for n %% 4 == 2 (n = 2 mod 4), and are in the middle of two observations otherwise.
The hinge vs quantile
distinction could thus be one source for the difference.
Second, geom_violin
refers to a density estimate. The source code here points to a function StatYdensity
, which leads me to here. I could not find the function compute_density
, but I think (also due to some pointers in help files) it is essentially density
, which by default uses a Gaussian kernel estimate to estimate the density. This may (or may not) explain the differences, but
by(d$Sepal.Length, d$Species, function(x) boxplot.stats(x, coef=5)$stats )
by(d$Sepal.Length, d$Species, function(v) quantile(density(v)$x))
do show indeed differing values. So, I would guess that the difference is due to whether we look at quantiles based on the empirical distribution function of the observations, or based on kernel density estimates, though I admit that I have not conclusively shown this.
The second factor that @coffeinjunky raised seems to be the main cause. Here is some more evidence to bolster that.
By switching to geom_ydensity
, one can empirically confirm that the difference is due to the geom_violin
using the kernel density estimate to compute the quantiles, rather than the actual observations. For example, if we force a wide bandwidth (bw=1
), then the estimated densities will be over-smoothed and deviate further from the observation-based quantiles used in the boxplots:
require(ggplot2)
require(cowplot)
theme_set(cowplot::theme_cowplot())
d = iris
ggplot2::ggplot(d, aes(factor(0), Sepal.Length)) +
stat_ydensity(bw=1, fill="black", alpha=0.2, draw_quantiles = c(0.25, 0.5, 0.75)
, colour = "red", size = 1.5) +
stat_boxplot(geom ='errorbar', width = 0.1)+
geom_boxplot(width = 0.2)+
facet_grid(. ~ Species, scales = "free_x") +
xlab("") +
ylab (expression(paste("Value"))) +
coord_cartesian(ylim = c(3.5,9.5)) +
scale_y_continuous(breaks = seq(4, 9, 1)) +
theme(axis.text.x=element_blank(),
axis.text.y = element_text(size = rel(1.5)),
axis.ticks.x = element_blank(),
strip.background=element_rect(fill="black"),
strip.text=element_text(color="white", face="bold"),
legend.position = "none") +
background_grid(major = "xy", minor = "none")
So, yes, be careful with this one - the parameters of the density estimation can impact the results!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With