Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lower and upper quartiles in boxplot in R

Tags:

plot

r

boxplot

I have

X=c(20 ,18, 34, 45, 30, 51, 63, 52, 29, 36, 27, 24)

With boxplot, i'm trying to plot the quantile(X,0.25) and quantile(X,0.75) but this is not realy the same lower and upper quartiles in boxplot in R

boxplot(X)
abline(h=quantile(X,0.25),col="red",lty=2)
abline(h=quantile(X,0.75),col="red",lty=2)

enter image description here Do you know why?

like image 645
Math Avatar asked Nov 16 '16 14:11

Math


2 Answers

The values of the box are called hinges and may coincide with the quartiles (as calculated by quantile(x, c(0.25, .075))), but are calculated differently.

From ?boxplot.stats:

The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4). The hinges equal the quartiles for odd n (where n <- length(x)) and differ for even n. Whereas the quartiles only equal observations for n %% 4 == 1 (n = 1 mod 4), the hinges do so additionally for n %% 4 == 2 (n = 2 mod 4), and are in the middle of two observations otherwise.

To see that the values coincide with an odd number of observations, try the following code:

set.seed(1234)
x <- rnorm(9)

boxplot(x)
abline(h=quantile(x, c(0.25, 0.75)), col="red")

enter image description here

like image 87
lmo Avatar answered Oct 03 '22 09:10

lmo


The discrepancy arises from an ambiguity in the definition of quantiles. No single method is strictly correct or incorrect - there are simply different ways to estimate quantiles in situations (such as an an even number of data points) when they do not neatly coincide with a specific data point and must be interpolated. Somewhat disconcertingly, boxplot and quantile (and other functions that provide summary statistics) use different default methods to calculate quantiles, although these defaults can be over-ridden using the type = argument in quantile

We can see these differences more clearly in action by looking at some of the various ways to generate quantile statistics in R.

Both boxplot and fivenum give the same values:

boxplot.stats(X)$stats
# [1] 18.0 25.5 32.0 48.0 63.0
fivenum(X)
# [1] 18.0 25.5 32.0 48.0 63.0

In boxplot and fivenum, the lower (upper) quartile is equivalent to the median of the lower (upper) half of the data (including the median of the complete data):

c(median(X[ X <= median(X) ]), median(X[ X >= median(X) ]))
# [1] 25.5  48.0

But, quartile and summary do things differently:

summary(X)
#  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 18.00   26.25   32.00   35.75   46.50   63.00

quantile(X, c(0.25,0.5,0.75))
#   25%   50%   75% 
# 26.25 32.00 46.50

The difference between this and the results from boxplot and fivenum hinges on how the functions interpolate between data. quartile attempts to interpolate by estimating the shape of the cumulative distribution function. According to ?quantile:

quantile returns estimates of underlying distribution quantiles based on one or two order statistics from the supplied elements in x at probabilities in probs. One of the nine quantile algorithms discussed in Hyndman and Fan (1996), selected by type, is employed.

The full details of the nine different methods quantile employs to estimate the distribution function of the data can be found in ?quantile, and are too lengthy to reproduce in full here. The important point to note is that the 9 methods are taken from Hyndman and Fan (1996) who recommended type 8. The default method used by quantile is type 7, for historical reasons of compatibility with S. We can see the estimates of the quartiles provided by different methods in quantile using:

quantile_methods = data.frame(q25 = sapply(1:9, function(method) quantile(X, 0.25, type = method)),
           q50 = sapply(1:9, function(method) quantile(X, 0.50, type = method)),
           q75 = sapply(1:9, function(method) quantile(X, 0.75, type = method)))
#       q25 q50    q75
# 1 24.0000  30 45.000
# 2 25.5000  32 48.000
# 3 24.0000  30 45.000
# 4 24.0000  30 45.000
# 5 25.5000  32 48.000
# 6 24.7500  32 49.500
# 7 26.2500  32 46.500
# 8 25.2500  32 48.500
# 9 25.3125  32 48.375

In which type = 5 provides the same estimated values of the quartiles as does boxplot. However, when there are an odd number of data, it is type=7 that will coincide with boxplot stats.

We can show this works by automatically selecting the type to be either 5 or 7 depending on whether there are an odd or even number of data. Boxplot in image below show quantiles for data sets with 1 to 30 values, with boxplot and quantile giving the same values for both odd and even N:

layout(matrix(1:30,5,6, byrow = T), respect = T)
par(mar=c(0.2,0.2,0.2,0.2), bty="n", yaxt="n", xaxt="n")

for (N in 1:30){
  X = sample(100, N)
  boxplot(X)
  abline(h=quantile(X, c(0.25, 0.5, 0.75), type=c(5,7)[(N %% 2) + 1]), col="red", lty=2)
}

enter image description here


Hyndman, R. J. and Fan, Y. (1996) Sample quantiles in statistical packages, American Statistician 50, 361–365

like image 36
dww Avatar answered Oct 03 '22 09:10

dww