Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ggplot2, geom_boxplot with custom quantiles and outliers

Tags:

r

ggplot2

I have a dataset which includes data from 100 simulations of train runs in a network with 4 trains, 6 stations and lateness at arrival for each train at each station. My data looks something like this:

MyData <- data.frame(
  Simulation = rep(sort(rep(1:100, 6)), 4),
  Train_number = sort(rep(c(100, 102, 104, 106), 100*6)), 
  Stations = rep(c("ST_1", "ST_2", "ST_3", "ST_4", "ST_5", "ST_6"), 100*4),
  Arrival_Lateness = c(rep(0, 60), rexp(40, 1), rep(0, 60), rexp(40, 2), rep(0, 60), rexp(40, 3), rep(0, 60), rexp(40, 5))
  )

I now create boxplots for each train and station with custom quantiles (thanks to jlhoward):

f <- function(x) {
  r <- quantile(x, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))
  names(r) <- c("ymin", "lower", "middle", "upper", "ymax")
  r
}

ggplot(MyData, aes(factor(Stations), Arrival_Lateness, fill = factor(Train_number))) + 
  stat_summary(fun.data = f, geom="boxplot", position="dodge")

Very pretty: enter image description here

What I am missing now is outliers. I would like to plot top 5% of observations for each train/station combination on tom of each boxplot. What I tried is this (inspired by this question):

q <- function(x) {
  subset(x, quantile(x, 0.95) < x)
}

ggplot(MyData, aes(factor(Stations), Arrival_Lateness, fill = factor(Train_number))) + 
  stat_summary(fun.data = f, geom="boxplot", position="dodge") + 
  stat_summary(fun.y = q, geom="point", position="dodge")

I get a message: "ymax not defined: adjusting position using y instead" and my chart looks like this:

enter image description here

which is clearly not what I wanted.

like image 475
Ratamahatta Avatar asked Feb 25 '14 19:02

Ratamahatta


1 Answers

This?

ggplot(MyData, aes(factor(Stations), Arrival_Lateness, 
                   fill = factor(Train_number))) + 
  stat_summary(fun.data = f, geom="boxplot", 
               position=position_dodge(1))+
  stat_summary(aes(color=factor(Train_number)),fun.y = q, geom="point", 
               position=position_dodge(1))

IMHO this is a little easier to interpret.

ggplot(MyData, aes(factor(Train_number), Arrival_Lateness, 
               fill = factor(Train_number))) + 
  stat_summary(fun.data = f, geom="boxplot",
               position=position_dodge(1))+
  stat_summary(aes(color=factor(Train_number)),fun.y = q, geom="point", 
               position=position_dodge(1))+
  facet_grid(.~Stations, scales="free")+
  theme(axis.text.x=element_text(angle=-90,hjust=1,vjust=0.2))+
  labs(x="Train Number")

EDIT (Response to OP's comment)

ggplot(MyData, aes(factor(Train_number), Arrival_Lateness, 
                   fill = factor(Train_number))) + 
  stat_summary(fun.data = f, geom="boxplot",
               position=position_dodge(1))+
  stat_summary(aes(color=factor(Train_number)),fun.y = q, geom="point", 
               position=position_dodge(1))+
  facet_grid(.~Stations, scales="free")+
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())+
  scale_fill_discrete("Train")+scale_color_discrete("Train")+
  labs(x="")

To turn off x-axis text and tick marks, us theme(...=element_blank()). To turn off the axis label, use labs(x=""). Also, the fill and color scales have to have the same name, or they display separately.

like image 69
jlhoward Avatar answered Sep 29 '22 08:09

jlhoward