My question is very similar to Normalizing y-axis in histograms in R ggplot to proportion, except that I have two groups of data of different size, and I would like that each proportion is relative to its group size instead of the total size.
To make it clearer, let's say I have two sets of data in a data frame:
dataA<-rnorm(100,3,sd=2) dataB<-rnorm(400,5,sd=3) all<-data.frame(dataset=c(rep('A',length(dataA)),rep('B',length(dataB))),value=c(dataA,dataB))
I can plot the two distributions together with:
ggplot(all,aes(x=value,fill=dataset))+geom_histogram(alpha=0.5,position='identity',binwidth=0.5)
and instead of the frequency on the Y axis I can have the proportion with:
ggplot(all,aes(x=value,fill=dataset))+geom_histogram(aes(y=..count../sum(..count..)),alpha=0.5,position='identity',binwidth=0.5)
But this gives the proportion relative to the total data size (500 points here): is it possible to have it relative to each group size?
My goal here is to make it possible to compare visually the proportion of values in a given bin between A and B, independently from their respective size. Ideas which differ from my original one are also welcome!
Thanks!
In order to create a histogram by group in ggplot2 you will need to input the numerical and the categorical variable inside aes and use geom_histogram as follows. You can also set the categorical variable to the colour argument, so the border lines of each histogram will have a different color.
In order to overlay a kernel density estimate over a histogram in ggplot2 you will need to pass aes (y = ..density..) to geom_histogram and add geom_density as in the example below. The color, line width and line type of the kernel density curve can be customized making use of colour, lwd and linetype arguments.
Now you can wrap the y value in stat () rather than messing with .. stuff. ggplot (mydataframe, aes (x = value)) + geom_histogram (aes (y = stat (count / sum (count)))) @CephBirkSuppose I also specify a fill=column to the aesthetic.
identity position Setting position = "identity" is the most common use case, but recall to set a level of transparency with alpha so both histograms are completely visible. library(ggplot2) ggplot(df, aes(x = x, fill = group, colour = group)) + geom_histogram(alpha = 0.5, position = "identity")
Like this? [edited based on OP's comment]
ggplot(all,aes(x=value,fill=dataset))+ geom_histogram(aes(y=0.5*..density..), alpha=0.5,position='identity',binwidth=0.5)
Using y=..density..
scales the histograms so the area under each is 1, or sum(binwidth*y)=1.
As a result, you would use y = binwidth*..density..
to have y represent the fraction of the total in each bin. In your case, binwidth=0.5
.
IMO this is a little easier to interpret:
ggplot(all,aes(x=value,fill=dataset))+ geom_histogram(aes(y=0.5*..density..),binwidth=0.5)+ facet_wrap(~dataset,nrow=2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With