Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different number of outliers with ggplot2

Can somebody explain to me why I get a different number of outliers with the normal boxplot command and with the geom_boxplot of ggplot2? Here you have an example:

x <- c(280.9, 135.9, 321.4, 333.7, 0.2, 71.3, 33.0, 102.6, 126.8, 194.8, 35.5, 
       107.3, 45.1, 107.2, 55.2, 28.1, 36.9, 24.3, 68.7, 163.5, 0.8, 31.8, 121.4, 
       84.7, 34.3, 25.2, 101.4, 203.2, 194.1, 27.9, 42.5, 47.0, 85.1, 90.4, 103.8, 
       45.1, 94.0, 36.0, 60.9, 97.1, 42.5, 96.4, 58.4, 174.0, 173.2, 164.1, 92.1, 
       41.9, 130.2, 94.7, 121.5, 261.4, 46.7, 16.3, 50.7, 112.9, 112.2, 242.5, 140.6, 
       112.6, 31.2, 36.7, 97.4, 140.5, 123.5, 42.9, 59.4, 94.5, 37.4, 232.2, 114.6, 
       60.7, 27.8, 115.5, 111.9, 60.1)
data <- data.frame(x)
boxplot(data$x)
ggplot(data, aes(y=x)) + geom_boxplot()

With the boxplot command I get the plot below with 4 outliers. enter image description here

And with ggplot2 I get the plot below with 5 outliers. enter image description here

like image 642
Alfredo Sánchez Avatar asked Dec 15 '18 16:12

Alfredo Sánchez


People also ask

How does Ggplot calculate outliers in boxplot?

In ggplot2, an observation is defined as an outlier if it meets one of the following two requirements: The observation is 1.5 times the interquartile range less than the first quartile (Q1) The observation is 1.5 times the interquartile range greater than the third quartile (Q3).

How do you label outliers on a boxplot in R?

We can identify and label these outliers by using the ggbetweenstats function in the ggstatsplot package. To label outliers, we're specifying the outlier. tagging argument as "TRUE" and we're specifying which variable to use to label each outlier with the outlier. label argument.

How does R boxplot determine outliers?

An outlier is an observation that is numerically distant from the rest of the data. When reviewing a boxplot, an outlier is defined as a data point that is located outside the fences (“whiskers”) of the boxplot (e.g: outside 1.5 times the interquartile range above the upper quartile and bellow the lower quartile).

What variables does Stat_boxplot () Compute?

stat_boxplot() provides the following variables, some of which depend on the orientation: width. width of boxplot. ymin or xmin.


1 Answers

ggplot and boxplot use slightly different methods to calculate the statistics. From ?geom_boxplot we can see

The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles). This differs slightly from the method used by the boxplot() function, and may be apparent with small samples. See boxplot.stats() for for more information on how hinge positions are calculated for boxplot().

You can get ggplot to use boxplot.stats if you want the same results

# Function to use boxplot.stats to set the box-and-whisker locations  
f.bxp = function(x) {
  bxp = boxplot.stats(x)[["stats"]]
  names(bxp) = c("ymin","lower", "middle","upper","ymax")
  bxp
}  

# Function to use boxplot.stats for the outliers
f.out = function(x) {
  data.frame(y=boxplot.stats(x)[["out"]])
}

To use those functions in ggplot:

ggplot(data, aes(0, y=x)) + 
  stat_summary(fun.data=f.bxp, geom="boxplot") + 
  stat_summary(fun.data=f.out, geom="point")

enter image description here

If you want to replicate the statistics that ggplot uses natively, these are explained in ?geom_boxplot as follows:

ymin = lower whisker = smallest observation greater than or equal to lower hinge - 1.5 * IQR

lower = lower hinge, 25% quantile

notchlower = lower edge of notch = median - 1.58 * IQR / sqrt(n)

middle = median, 50% quantile

notchupper = upper edge of notch = median + 1.58 * IQR / sqrt(n)

upper = upper hinge, 75% quantile

ymax = upper whisker = largest observation less than or equal to upper hinge + 1.5 * IQR

We can calculate these accordingly:

y = sort(x)
iqr = quantile(y,0.75) - quantile(y,0.25)
ymin = y[which(y >= quantile(y,0.25) - 1.5*iqr)][1]
ymax = tail(y[which(y <= quantile(y,0.75) + 1.5*iqr)],1)
lower = quantile(y,0.25)
upper = quantile(y,0.75)
middle = quantile(y,0.5)

ggplot(data, aes(y=x)) + 
  geom_boxplot() +
  geom_hline(aes(yintercept=c(ymin)), color='red', linetype='dashed') +
  geom_hline(aes(yintercept=c(ymax)), color='red', linetype='dashed') +
  geom_hline(aes(yintercept=c(lower)), color='red', linetype='dashed') +
  geom_hline(aes(yintercept=c(upper)), color='red', linetype='dashed') +
  geom_hline(aes(yintercept=c(middle)), color='red', linetype='dashed') 

enter image description here

We can also extract these statistics directly from a ggplot object using ggplot_build

p <- ggplot(data, aes(y=x)) + geom_boxplot() 
ggplot_build(p)$data[1:5]

#   ymin lower middle upper  ymax 
# 1  0.2  42.5  93.05   122 232.2 
like image 152
dww Avatar answered Sep 21 '22 09:09

dww