Can somebody explain to me why I get a different number of outliers
with the normal boxplot command and with the geom_boxplot
of ggplot2?
Here you have an example:
x <- c(280.9, 135.9, 321.4, 333.7, 0.2, 71.3, 33.0, 102.6, 126.8, 194.8, 35.5,
107.3, 45.1, 107.2, 55.2, 28.1, 36.9, 24.3, 68.7, 163.5, 0.8, 31.8, 121.4,
84.7, 34.3, 25.2, 101.4, 203.2, 194.1, 27.9, 42.5, 47.0, 85.1, 90.4, 103.8,
45.1, 94.0, 36.0, 60.9, 97.1, 42.5, 96.4, 58.4, 174.0, 173.2, 164.1, 92.1,
41.9, 130.2, 94.7, 121.5, 261.4, 46.7, 16.3, 50.7, 112.9, 112.2, 242.5, 140.6,
112.6, 31.2, 36.7, 97.4, 140.5, 123.5, 42.9, 59.4, 94.5, 37.4, 232.2, 114.6,
60.7, 27.8, 115.5, 111.9, 60.1)
data <- data.frame(x)
boxplot(data$x)
ggplot(data, aes(y=x)) + geom_boxplot()
With the boxplot
command I get the plot below with 4 outliers
.
And with ggplot2
I get the plot below with 5 outliers
.
In ggplot2, an observation is defined as an outlier if it meets one of the following two requirements: The observation is 1.5 times the interquartile range less than the first quartile (Q1) The observation is 1.5 times the interquartile range greater than the third quartile (Q3).
We can identify and label these outliers by using the ggbetweenstats function in the ggstatsplot package. To label outliers, we're specifying the outlier. tagging argument as "TRUE" and we're specifying which variable to use to label each outlier with the outlier. label argument.
An outlier is an observation that is numerically distant from the rest of the data. When reviewing a boxplot, an outlier is defined as a data point that is located outside the fences (“whiskers”) of the boxplot (e.g: outside 1.5 times the interquartile range above the upper quartile and bellow the lower quartile).
stat_boxplot() provides the following variables, some of which depend on the orientation: width. width of boxplot. ymin or xmin.
ggplot and boxplot use slightly different methods to calculate the statistics. From ?geom_boxplot
we can see
The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles). This differs slightly from the method used by the boxplot() function, and may be apparent with small samples. See boxplot.stats() for for more information on how hinge positions are calculated for boxplot().
You can get ggplot to use boxplot.stats
if you want the same results
# Function to use boxplot.stats to set the box-and-whisker locations
f.bxp = function(x) {
bxp = boxplot.stats(x)[["stats"]]
names(bxp) = c("ymin","lower", "middle","upper","ymax")
bxp
}
# Function to use boxplot.stats for the outliers
f.out = function(x) {
data.frame(y=boxplot.stats(x)[["out"]])
}
To use those functions in ggplot:
ggplot(data, aes(0, y=x)) +
stat_summary(fun.data=f.bxp, geom="boxplot") +
stat_summary(fun.data=f.out, geom="point")
If you want to replicate the statistics that ggplot uses natively, these are explained in ?geom_boxplot
as follows:
ymin = lower whisker = smallest observation greater than or equal to lower hinge - 1.5 * IQR
lower = lower hinge, 25% quantile
notchlower = lower edge of notch = median - 1.58 * IQR / sqrt(n)
middle = median, 50% quantile
notchupper = upper edge of notch = median + 1.58 * IQR / sqrt(n)
upper = upper hinge, 75% quantile
ymax = upper whisker = largest observation less than or equal to upper hinge + 1.5 * IQR
We can calculate these accordingly:
y = sort(x)
iqr = quantile(y,0.75) - quantile(y,0.25)
ymin = y[which(y >= quantile(y,0.25) - 1.5*iqr)][1]
ymax = tail(y[which(y <= quantile(y,0.75) + 1.5*iqr)],1)
lower = quantile(y,0.25)
upper = quantile(y,0.75)
middle = quantile(y,0.5)
ggplot(data, aes(y=x)) +
geom_boxplot() +
geom_hline(aes(yintercept=c(ymin)), color='red', linetype='dashed') +
geom_hline(aes(yintercept=c(ymax)), color='red', linetype='dashed') +
geom_hline(aes(yintercept=c(lower)), color='red', linetype='dashed') +
geom_hline(aes(yintercept=c(upper)), color='red', linetype='dashed') +
geom_hline(aes(yintercept=c(middle)), color='red', linetype='dashed')
We can also extract these statistics directly from a ggplot object using ggplot_build
p <- ggplot(data, aes(y=x)) + geom_boxplot()
ggplot_build(p)$data[1:5]
# ymin lower middle upper ymax
# 1 0.2 42.5 93.05 122 232.2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With