<pre class="prettyprint"><code>require(ggplot2) require(cowplot) d = iris ggplot2::ggplot(d, aes(factor(0), Sepal.Length)) + geom_violin(fill="black", alpha=0.2, draw_quantiles = c(0.25, 0.5, 0.75) , colour = "red", size = 1.5) + stat_boxplot(geom ='errorbar', width = 0.1)+ geom_boxplot(width = 0.2)+ facet_grid(. ~ Species, scales = "free_x") + xlab("") + ylab (expression(paste("Value"))) + coord_cartesian(ylim = c(3.5,9.5)) + scale_y_continuous(breaks = seq(4, 9, 1)) + theme(axis.text.x=element_blank(), axis.text.y = element_text(size = rel(1.5)), axis.ticks.x = element_blank(), strip.background=element_rect(fill="black"), strip.text=element_text(color="white", face="bold"), legend.position = "none") + background_grid(major = "xy", minor = "none") </code></pre> <img src="https://i.stack.imgur.com/cJItC.png" alt="boxplot vs. violinplot"> To my knowledge box ends in boxplots represent the 25% and 75% quantile, respectively, and the median = 50%. So they should be equal to the 0.25/0.5/0.75 quantiles which are drawn by <code>geom_violin</code> in the <code>draw_quantiles = c(0.25, 0.5, 0.75)</code> argument. Median and 50% quantile fit. However, both 0.25 and 0.75 quantile do not fit the box ends of the boxplot (see figure, especially 'virginica' facet). References: <ol> <li>http://docs.ggplot2.org/current/geom_violin.html</li> <li>http://docs.ggplot2.org/current/geom_boxplot.html</li> </ol>

This is too long for a comment, so I post it as an answer. I see two potential sources for the divergence. First, my understanding is that the <code>boxplot</code> refers to <code>boxplot.stats</code>, which uses <code>hinges</code> that are very similar but not necessarily identical to the quantiles. <code>?boxplot.stats</code> says: <blockquote> The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4). The hinges equal the quartiles for odd n (where n <- length(x)) and differ for even n. Whereas the quartiles only equal observations for n %% 4 == 1 (n = 1 mod 4), the hinges do so additionally for n %% 4 == 2 (n = 2 mod 4), and are in the middle of two observations otherwise. </blockquote> The <code>hinge vs quantile</code> distinction could thus be one source for the difference. Second, <code>geom_violin</code> refers to a density estimate. The source code here points to a function <code>StatYdensity</code>, which leads me to here. I could not find the function <code>compute_density</code>, but I think (also due to some pointers in help files) it is essentially <code>density</code>, which by default uses a Gaussian kernel estimate to estimate the density. This may (or may not) explain the differences, but <pre class="prettyprint"><code>by(d$Sepal.Length, d$Species, function(x) boxplot.stats(x, coef=5)$stats ) by(d$Sepal.Length, d$Species, function(v) quantile(density(v)$x)) </code></pre> do show indeed differing values. So, I would guess that the difference is due to whether we look at quantiles based on the empirical distribution function of the observations, or based on kernel density estimates, though I admit that I have not conclusively shown this.

Differing quantiles: Boxplot vs. Violinplot

Tags:

r

ggplot2

boxplot

violin-plot

quantile

require(ggplot2)
require(cowplot)
d = iris

ggplot2::ggplot(d, aes(factor(0), Sepal.Length)) + 
    geom_violin(fill="black", alpha=0.2, draw_quantiles = c(0.25, 0.5, 0.75)
                , colour = "red", size = 1.5) +
    stat_boxplot(geom ='errorbar', width = 0.1)+
    geom_boxplot(width = 0.2)+
    facet_grid(. ~ Species, scales = "free_x") +
    xlab("") + 
    ylab (expression(paste("Value"))) +
    coord_cartesian(ylim = c(3.5,9.5)) + 
    scale_y_continuous(breaks = seq(4, 9, 1)) + 
    theme(axis.text.x=element_blank(),
          axis.text.y = element_text(size = rel(1.5)),
          axis.ticks.x = element_blank(),
          strip.background=element_rect(fill="black"),
          strip.text=element_text(color="white", face="bold"),
          legend.position = "none") +
    background_grid(major = "xy", minor = "none")

boxplot vs. violinplot

To my knowledge box ends in boxplots represent the 25% and 75% quantile, respectively, and the median = 50%. So they should be equal to the 0.25/0.5/0.75 quantiles which are drawn by geom_violin in the draw_quantiles = c(0.25, 0.5, 0.75) argument.

Median and 50% quantile fit. However, both 0.25 and 0.75 quantile do not fit the box ends of the boxplot (see figure, especially 'virginica' facet).

References:

http://docs.ggplot2.org/current/geom_violin.html
http://docs.ggplot2.org/current/geom_boxplot.html

505

asked Mar 16 '16 10:03

pat-s

2 Answers

This is too long for a comment, so I post it as an answer. I see two potential sources for the divergence. First, my understanding is that the boxplot refers to boxplot.stats, which uses hinges that are very similar but not necessarily identical to the quantiles. ?boxplot.stats says:

The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4). The hinges equal the quartiles for odd n (where n <- length(x)) and differ for even n. Whereas the quartiles only equal observations for n %% 4 == 1 (n = 1 mod 4), the hinges do so additionally for n %% 4 == 2 (n = 2 mod 4), and are in the middle of two observations otherwise.

The hinge vs quantile distinction could thus be one source for the difference.

Second, geom_violin refers to a density estimate. The source code here points to a function StatYdensity, which leads me to here. I could not find the function compute_density, but I think (also due to some pointers in help files) it is essentially density, which by default uses a Gaussian kernel estimate to estimate the density. This may (or may not) explain the differences, but

by(d$Sepal.Length, d$Species, function(x) boxplot.stats(x, coef=5)$stats )
by(d$Sepal.Length, d$Species, function(v) quantile(density(v)$x))

do show indeed differing values. So, I would guess that the difference is due to whether we look at quantiles based on the empirical distribution function of the observations, or based on kernel density estimates, though I admit that I have not conclusively shown this.

185

answered Sep 29 '22 14:09

coffeinjunky

The second factor that @coffeinjunky raised seems to be the main cause. Here is some more evidence to bolster that.

By switching to geom_ydensity, one can empirically confirm that the difference is due to the geom_violin using the kernel density estimate to compute the quantiles, rather than the actual observations. For example, if we force a wide bandwidth (bw=1), then the estimated densities will be over-smoothed and deviate further from the observation-based quantiles used in the boxplots:

require(ggplot2)
require(cowplot)

theme_set(cowplot::theme_cowplot())

d = iris

ggplot2::ggplot(d, aes(factor(0), Sepal.Length)) + 
  stat_ydensity(bw=1, fill="black", alpha=0.2, draw_quantiles = c(0.25, 0.5, 0.75)
              , colour = "red", size = 1.5) +
  stat_boxplot(geom ='errorbar', width = 0.1)+
  geom_boxplot(width = 0.2)+
  facet_grid(. ~ Species, scales = "free_x") +
  xlab("") + 
  ylab (expression(paste("Value"))) +
  coord_cartesian(ylim = c(3.5,9.5)) + 
  scale_y_continuous(breaks = seq(4, 9, 1)) + 
  theme(axis.text.x=element_blank(),
        axis.text.y = element_text(size = rel(1.5)),
        axis.ticks.x = element_blank(),
        strip.background=element_rect(fill="black"),
        strip.text=element_text(color="white", face="bold"),
        legend.position = "none") +
  background_grid(major = "xy", minor = "none")

enter image description here

So, yes, be careful with this one - the parameters of the density estimation can impact the results!

answered Sep 29 '22 15:09

merv

Related questions
                            
                                httr: retrieving data with POST()
                            
                                ggplot2 - add horizontal line to faceted plot with dates on x-axis
                            
                                Evaluating both column name and the target value within `j` expression within `data.table`
                            
                                Row maximum in data table
                            
                                ggplot2: Using gtable to move strip labels to top of panel for facet_grid
                            
                                Find repeated pattern in a string of characters using R
                            
                                Generating and Summing Matrix
                            
                                How can I click a link in a webpage in Rselenium?
                            
                                Is there a way to use 32-bit float instead of 64-bit in R dataframes?
                            
                                Transporting Sparse Matrix from Python to R
                            
                                ifelse() stripping POSIXct attribute from vector of timestamps?
                            
                                Figure name in caption using RMarkdown
                            
                                creating sequence of dates for each group in r
                            
                                How to prep transaction data into basket for arules
                            
                                Sankey diagram in R
                            
                                Setting NA in a matrix using another logical matrix
                            
                                horizontal ggplot2::geom_violin without coord_flip
                            
                                Creating internal functions (can't be called from console) in R
                            
                                Why does data.table get copied when adding a new column?
                            
                                XGBoost - Poisson distribution with varying exposure / offset

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With