Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Boxplot outlier labeling in R

Tags:

r

I want to draw boxplots in R and add names to outliers. So far I found this solution.

The function there provides all the functionality I need, but it scrambles incorrectly the labels. In the following example, it marks the outlier as "u" instead of "o":

library(plyr)
library(TeachingDemos)
source("http://www.r-statistics.com/wp-content/uploads/2011/01/boxplot-with-outlier-label-r.txt") # Load the function
set.seed(1500)
y <- rnorm(20)
x1 <- sample(letters[1:2], 20,T)
lab_y <- sample(letters, 20)
# plot a boxplot with interactions:
boxplot.with.outlier.label(y~x1, lab_y)

Do you know of any solution? The ggplot2 library is super nice, but provides no such functionality (as far as I know). My alternative is to use the text() function and extract the outlier information from the boxplot object. However, like this the labels may overlap.

Thanks a lot :-)

like image 589
Federico Giorgi Avatar asked Oct 28 '11 12:10

Federico Giorgi


2 Answers

I took a look at this with debug(boxplot.with.outlier.label), and ... it turns out there's a bug in the function.

The error occurs on line 125, where the data.frame DATA is constructed from x,y and label_name.

Previously x and y have been reordered, while lab_y hasn't been. When the supplied value of x (your x1) isn't itself already in order, you'll get the kind of jumbling you experienced.

As an immediate fix, you can pre-order the x values like this (or do something more elegant)

df <- data.frame(y, x1, lab_y, stringsAsFactors=FALSE)
df <- df[order(df$x1), ]
# Needed since lab_y is not searched for in data (though it probably should be)
lab_y <- df$lab_y  

boxplot.with.outlier.label(y~x1, lab_y, data=df)

Boxplot produced by procedure described above

like image 137
Josh O'Brien Avatar answered Sep 30 '22 00:09

Josh O'Brien


The intelligent point label placement is a separate issue discussed here or here. There's no ultimate and ideal solution so you just have to pick one there.

So you would overplot the normal boxplot with labels, as follows:

set.seed(1501)
y <- c(4, 0, 7, -5, rnorm(16))
x1 <- c("a", "a", "b", "b", sample(letters[1:2], 16, T))
lab_y <- sample(letters, 20)

bx <- boxplot(y~x1)

out_lab <- c()
for (i in seq(bx$out)) { 
    out_lab[i] <- lab_y[which(y == bx$out[i])[1]]
}

identify(bx$group, bx$out, labels = out_lab, cex = 0.7)

Then, during the identify() is running, you just click to position where you want the label, as described here. When finished, you just press "STOP". Note that each outlier can have more than one label! In my solution, I just simply picked the first!!

PS: I feel ashamed for the for loop, but don't know how to vectorize it - feel free to post improvement.

EDIT: inspired by the Federico's link now I see it can be done much easier! Just these 2 commands:

boxplot(y~x1)
identify(as.integer(as.factor(x1)), y, labels = lab_y, cex = 0.7)
like image 20
Tomas Avatar answered Sep 29 '22 22:09

Tomas