Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalizing y-axis in histograms in R ggplot to proportion by group

My question is very similar to Normalizing y-axis in histograms in R ggplot to proportion, except that I have two groups of data of different size, and I would like that each proportion is relative to its group size instead of the total size.

To make it clearer, let's say I have two sets of data in a data frame:

dataA<-rnorm(100,3,sd=2) dataB<-rnorm(400,5,sd=3) all<-data.frame(dataset=c(rep('A',length(dataA)),rep('B',length(dataB))),value=c(dataA,dataB)) 

I can plot the two distributions together with:

ggplot(all,aes(x=value,fill=dataset))+geom_histogram(alpha=0.5,position='identity',binwidth=0.5) 

and instead of the frequency on the Y axis I can have the proportion with:

ggplot(all,aes(x=value,fill=dataset))+geom_histogram(aes(y=..count../sum(..count..)),alpha=0.5,position='identity',binwidth=0.5) 

But this gives the proportion relative to the total data size (500 points here): is it possible to have it relative to each group size?

My goal here is to make it possible to compare visually the proportion of values in a given bin between A and B, independently from their respective size. Ideas which differ from my original one are also welcome!

Thanks!

like image 520
Erwan Avatar asked Mar 04 '14 19:03

Erwan


People also ask

How to create a histogram by group in ggplot2?

In order to create a histogram by group in ggplot2 you will need to input the numerical and the categorical variable inside aes and use geom_histogram as follows. You can also set the categorical variable to the colour argument, so the border lines of each histogram will have a different color.

How to overlay a kernel density curve over a histogram in ggplot2?

In order to overlay a kernel density estimate over a histogram in ggplot2 you will need to pass aes (y = ..density..) to geom_histogram and add geom_density as in the example below. The color, line width and line type of the kernel density curve can be customized making use of colour, lwd and linetype arguments.

How to wrap y value in stat() in ggplot?

Now you can wrap the y value in stat () rather than messing with .. stuff. ggplot (mydataframe, aes (x = value)) + geom_histogram (aes (y = stat (count / sum (count)))) @CephBirkSuppose I also specify a fill=column to the aesthetic.

How to set the position of identity in ggplot2?

identity position Setting position = "identity" is the most common use case, but recall to set a level of transparency with alpha so both histograms are completely visible. library(ggplot2) ggplot(df, aes(x = x, fill = group, colour = group)) + geom_histogram(alpha = 0.5, position = "identity")


1 Answers

Like this? [edited based on OP's comment]

ggplot(all,aes(x=value,fill=dataset))+   geom_histogram(aes(y=0.5*..density..),                  alpha=0.5,position='identity',binwidth=0.5) 

Using y=..density.. scales the histograms so the area under each is 1, or sum(binwidth*y)=1. As a result, you would use y = binwidth*..density.. to have y represent the fraction of the total in each bin. In your case, binwidth=0.5.

IMO this is a little easier to interpret:

ggplot(all,aes(x=value,fill=dataset))+   geom_histogram(aes(y=0.5*..density..),binwidth=0.5)+   facet_wrap(~dataset,nrow=2) 
like image 161
jlhoward Avatar answered Sep 19 '22 00:09

jlhoward