Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a ggplot2 histogram with a cumulative distribution curve (ECDF) on a different scale

Tags:

r

ggplot2

Using ggplot2, I can create a histogram with a cumulative distribution curve with the following code. However, the stat_ecdf curve is scaled to the left y-axis.

library(ggplot2)
test.data <- data.frame(values = replicate(1, sample(0:10,1000, rep=TRUE)))
g <- ggplot(test.data, aes(x=values))
g + geom_bar() + 
    stat_ecdf() + 
    scale_y_continuous(sec.axis=sec_axis(trans = ~./100, name="percentage"))

Here is the graph generated (you can see the ecdf at the bottom): ggplot result

How do I scale the stat_ecdf to the second y-axis?

like image 526
zambonee Avatar asked Oct 20 '25 05:10

zambonee


1 Answers

In general, you want to multiply the internally calculated ECDF value (the cumulative density), which is called ..y.., by the inverse of the axis transformation, so that its vertical extent will be similar to that of the bars:

library(tidyverse)
library(scales)

set.seed(2)
test.data <- data.frame(values = replicate(1, sample(0:10,1000, rep=TRUE)))

ggplot(test.data, aes(x=values)) +
  geom_bar(fill="grey70") + 
  stat_ecdf(aes(y=..y..*100)) + 
  scale_y_continuous(sec.axis=sec_axis(trans = ~./100 , name="percentage", labels=percent)) +
  theme_bw()

enter image description here

Because you distributed 1,000 values randomly among 11 buckets, it happened to turn out that both y-scales were multiples of 10. Below is a more general version.

In addition, it would be nice to be able to programmatically determine the transformation factor, so that we don't have to pick it by hand after seeing the bar heights in the plot. To do that, we calculate the height of the highest bar outside ggplot and use that value (called max_y below) in the plot. We also use the pretty function to reset max_y to the highest break value on the y-axis associated with the highest bar (ggplot uses pretty to set the default axis breaks), so that the primary and secondary y-axis breaks will line up.

Finally, we use aes_ and bquote to create a quoted call, so that ggplot will recognize the passed max_y value.

set.seed(2)
test.data <- data.frame(values = replicate(1, sample(0:10,768, rep=TRUE)))

max_y = max(table(test.data$values))
max_y = max(pretty(c(0,max_y)))

ggplot(test.data, aes(x=values)) +
  geom_bar(fill="grey70") + 
  stat_ecdf(aes_(y=bquote(..y.. * .(max_y)))) + 
  scale_y_continuous(sec.axis=sec_axis(trans = ~./max_y, name="percentage", labels=percent)) +
  theme_bw()

enter image description here

like image 70
eipi10 Avatar answered Oct 21 '25 19:10

eipi10