The R boxplot function is a very useful way to look at data: it quickly provides you with a visual summary of the approximate location and variance of your data, and the number of outliers. In addition, I'd like to identify the outliers, in order to quickly find problems in the dataset.
The values of these outliers can be accessed using myplot$out
. Unfortunately, the labels of these outliers seem to be unavailable. There are some packages aimed at displaying the labels on the plot itself: http://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/, but they don't work well and I just want to list these outliers, I don't need them to be on the plot itself.
Any ideas?
You've done most of the hard work yourself. All that is remaining is a comparison:
##First create some data
##You should include this in your question)
set.seed(2)
dd = data.frame(x = rlnorm(26), y=LETTERS)
Grab the outliers
outliers = boxplot(dd$x, plot=FALSE)$out
Extract the outliers from the original data frame
dd[dd$x %in% outliers,]
Further explanation:
The variable dd$x
is the vector of 26 numbers. The variable outliers
contains the values of the outliers (just type dd$x
and outliers
in your R console). The command
dd$x %in% outliers
matches the values of dd$x and outliers, viz:
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE <snip>
The square bracket notation, dd[dd$x %in% outliers,]
returns the rows of the data frame dd
, where dd$x %in% outliers
return TRUE
.
I suggest which(x < myplot$stats[1] | x > myplot$stats[5])
where x is your data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With