Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Overlay raw data onto geom_bar

Tags:

r

ggplot2

I have a data-frame arranged as follows:

condition,treatment,value
A        ,  one    , 2
A        ,  one    , 1
A        ,  two    , 4
A        ,  two    , 2
...
D        ,  two    , 3

I have used ggplot2 to make a grouped bar plot that looks like this: example plot

The bars are grouped by "condition" and the colours indicate "treatment." The bar heights are the mean of the values for each condition/treatment pair. I achieved this by creating a new data frame containing the mean and standard error (for the error bars) for all the points that will make up each group.

What I would like to do is superimpose the raw jittered data to produce a bar-chart version of this box plot: http://docs.ggplot2.org/0.9.3.1/geom_boxplot-6.png [I realise that a box plot would probably be better, but my hands are tied because the client is pathologically attached to bar charts]

I have tried adding a geom_point object to my plot and feeding it the raw data (rather than the aggregated means which were used to make the bars). This sort of works, but it plots the raw values at the wrong x axis locations. They appear at the points at which the red and grey bars join, rather than at the centres of the appropriate bar. So my plot looks like this:

wrong-looking plot

I can not figure out how to shift the points by a fixed amount and then jitter them in order to get them centered over the correct bar. Anyone know? Is there, perhaps, a better way of achieving what I'm trying to do?

What follows is a minimal example that shows the problem I have:

#Make some fake data
ex=data.frame(cond=rep(c('a','b','c','d'),each=8),
    treat=rep(rep(c('one','two'),4),each=4),
    value=rnorm(32) + rep(c(3,1,4,2),each=4) )

#Calculate the mean and SD of each condition/treatment pair
agg=aggregate(value~cond*treat, data=ex, FUN="mean") #mean
agg$sd=aggregate(value~cond*treat, data=ex, FUN="sd")$value #add the SD 


dodge <- position_dodge(width=0.9) 
limits <- aes(ymax=value+sd, ymin=value-sd) #Set up the error bars

p <- ggplot(agg, aes(fill=treat, y=value, x=cond)) 

#Plot, attempting to overlay the raw data
print(
       p + geom_bar(position=dodge, stat="identity") +
       geom_errorbar(limits, position=dodge, width=0.25) + 
       geom_point(data= ex[ex$treat=='one',], colour="green", size=3) +
       geom_point(data= ex[ex$treat=='two',], colour="pink", size=3)
)
like image 474
RAAC Avatar asked May 24 '13 05:05

RAAC


3 Answers

I found it is unnecessary to create separate dataframes. The plot can be created by providing ggplot with the raw data.

ex <- data.frame(cond=rep(c('a','b','c','d'),each=8),
              treat=rep(rep(c('one','two'),4),each=4),
              value=rnorm(32) + rep(c(3,1,4,2),each=4) )

p <- ggplot(ex, aes(cond,value,fill = treat))
p + geom_bar(position = 'dodge', stat = 'summary', fun.y = 'mean') +
  geom_errorbar(stat = 'summary', position = 'dodge', width = 0.9) +
  geom_point(aes(x = cond), shape = 21, position = position_dodge(width = 1))

ggplot example

like image 172
holmrenser Avatar answered Nov 14 '22 21:11

holmrenser


You need just one call to geom_point() where you use data frame ex and set x values to cond, y values to value and color=treat (inside aes()). Then add position=dodge to ensure that points are dodgeg. With scale_color_manual() and argument values= you can set colors you need.

    p+geom_bar(position=dodge, stat="identity") +
      geom_errorbar(limits, position=dodge, width=0.25)+
      geom_point(data=ex,aes(cond,value,color=treat),position=dodge)+
      scale_color_manual(values=c("green","pink"))

enter image description here

UPDATE - jittering of points

You can't directly use positions dodge and jitter together. But there are some workarounds. If you save whole plot as object then with ggplot_build() you can see x positions for bars - in this case they are 0.775, 1.225, 1.775... Those positions correspond to combinations of factors cond and treat. As in data frame ex there are 4 values for each combination, then add new column that contains those x positions repeated 4 times.

ex$xcord<-rep(c(0.775,1.225,1.775,2.225,2.775,3.225,3.775,4.225),each=4)

Now in geom_point() use this new column as x values and set position to jitter.

p+geom_bar(position=dodge, stat="identity") +
  geom_errorbar(limits, position=dodge, width=0.25)+
  geom_point(data=ex,aes(xcord,value,color=treat),position=position_jitter(width =.15))+
  scale_color_manual(values=c("green","pink"))

enter image description here

like image 29
Didzis Elferts Avatar answered Nov 14 '22 22:11

Didzis Elferts


As illustrated by holmrenser above, referencing a single dataframe and updating the stat instruction to "summary" in the geom_bar function is more efficient than creating additional dataframes and retaining the stat instruction as "identity" in the code.

To both jitter and dodge the data points with the bar charts per the OP's original question, this can also be accomplished by updating the position instruction in the code with position_jitterdodge. This positioning scheme allows widths for jitter and dodge terms to be customized independently, as follows:

p <- ggplot(ex, aes(cond,value,fill = treat))
p + geom_bar(position = 'dodge', stat = 'summary', fun.y = 'mean') +
geom_errorbar(stat = 'summary', position = 'dodge', width = 0.9) +
geom_point(aes(x = cond), shape = 21, position = 
           position_jitterdodge(jitter.width = 0.5, jitter.height=0.4, 
                                dodge.width=0.9))

enter image description here

like image 30
Paul Sochacki Avatar answered Nov 14 '22 23:11

Paul Sochacki