I would like to use ggplot2 to illustrate the difference between two similar density distributions. Here is a toy example of the type of data I have:
library(ggplot2)
# Make toy data
n_sp <- 100000
n_dup <- 50000
D <- data.frame(
event=c(rep("sp", n_sp), rep("dup", n_dup) ),
q=c(rnorm(n_sp, mean=2.0), rnorm(n_dup, mean=2.1))
)
# Standard density plot
ggplot( D, aes( x=q, y=..density.., col=event ) ) +
geom_freqpoly()
Rather than separately plot the density for each category ( dup
and sp
) as above, how could I plot a single line that shows the difference between these distributions?
In the toy example above, if I subtracted the dup
density distribution from the sp
density distribution, the resulting line would be above zero on the left side of the plot (since there is an abundance of smaller sp
values) and below 0 on the right (since there is an abundance of larger dup
values). Not that there may be a different number of observations of type dup
and sp
.
More generally - what is the best way to show differences between similar density distributions?
There may be a way to do this within ggplot, but frequently it's easiest to do the calculations beforehand. In this case, call density
on each subset of q
over the same range, then subtract the y values. Using dplyr (translate to base R or data.table if you wish),
library(dplyr)
library(ggplot2)
D %>% group_by(event) %>%
# calculate densities for each group over same range; store in list column
summarise(d = list(density(q, from = min(.$q), to = max(.$q)))) %>%
# make a new data.frame from two density objects
do(data.frame(x = .$d[[1]]$x, # grab one set of x values (which are the same)
y = .$d[[1]]$y - .$d[[2]]$y)) %>% # and subtract the y values
ggplot(aes(x, y)) + # now plot
geom_line()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With