We can remove outliers in R by setting the outlier. shape argument to NA. In addition, the coord_cartesian() function will be used to reject all outliers that exceed or below a given quartile. The y-axis of ggplot2 is not automatically adjusted.
There are no specific commands in SPSS to remove outliers from analysis or the Active DataSet, you fill first have to find out what observations are outliers and then remove them using case selection Select cases . Make sure to understand that you can select observations.
Use geom_boxplot(outlier.shape = NA)
to not display the outliers and scale_y_continuous(limits = c(lower, upper))
to change the axis limits.
An example.
n <- 1e4L
dfr <- data.frame(
y = exp(rlnorm(n)), #really right-skewed variable
f = gl(2, n / 2)
)
p <- ggplot(dfr, aes(f, y)) +
geom_boxplot()
p # big outlier causes quartiles to look too slim
p2 <- ggplot(dfr, aes(f, y)) +
geom_boxplot(outlier.shape = NA) +
scale_y_continuous(limits = quantile(dfr$y, c(0.1, 0.9)))
p2 # no outliers plotted, range shifted
Actually, as Ramnath showed in his answer (and Andrie too in the comments), it makes more sense to crop the scales after you calculate the statistic, via coord_cartesian
.
coord_cartesian(ylim = quantile(dfr$y, c(0.1, 0.9)))
(You'll probably still need to use scale_y_continuous
to fix the axis breaks.)
Here is a solution using boxplot.stats
# create a dummy data frame with outliers
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# compute lower and upper whiskers
ylim1 = boxplot.stats(df$y)$stats[c(1, 5)]
# scale y limits based on ylim1
p1 = p0 + coord_cartesian(ylim = ylim1*1.05)
I had the same problem and precomputed the values for Q1, Q2, median, ymin, ymax using boxplot.stats
:
# Load package and generate data
library(ggplot2)
data <- rnorm(100)
# Compute boxplot statistics
stats <- boxplot.stats(data)$stats
df <- data.frame(x="label1", ymin=stats[1], lower=stats[2], middle=stats[3],
upper=stats[4], ymax=stats[5])
# Create plot
p <- ggplot(df, aes(x=x, lower=lower, upper=upper, middle=middle, ymin=ymin,
ymax=ymax)) +
geom_boxplot(stat="identity")
p
The result is a boxplot without outliers.
One idea would be to winsorize the data in a two-pass procedure:
run a first pass, learn what the bounds are, e.g. cut of at given percentile, or N standard deviation above the mean, or ...
in a second pass, set the values beyond the given bound to the value of that bound
I should stress that this is an old-fashioned method which ought to be dominated by more modern robust techniques but you still come across it a lot.
Ipaper::geom_boxplot2 is just what you want.
# devtools::install_github('kongdd/Ipaper')
library(Ipaper)
library(ggplot2)
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot2(width = 0.8, width.errorbar = 0.5)
If you want to force the whiskers to extend to the max and min values, you can tweak the coef
argument. Default value for coef
is 1.5 (i.e. default length of the whiskers is 1.5 times the IQR).
# Load package and create a dummy data frame with outliers
#(using example from Ramnath's answer above)
library(ggplot2)
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# create boxplot where whiskers extend to max and min values
p1 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)), coef = 500)
Simple, dirty and effective. geom_boxplot(outlier.alpha = 0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With