Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In R ggplot2, include stat_ecdf() endpoints (0,0) and (1,1)

Tags:

r

ggplot2

ecdf

I'm trying to use stat_ecdf() to plot cumulative successes as a function of a rank score created by a predictive model.

#libraries
require(ggplot2)
require(scales)

# fake data for reproducibility
set.seed(123)
n <- 200
df <- data.frame(model_score= rexp(n=n,rate=1:n),
                 obs_set= sample(c("training","validation"),n,replace=TRUE))
df$model_rank <- rank(df$model_score)/n
df$target_outcome <- rbinom(n,1,1-df$model_rank)

# Plot Gain Chart using stat_ecdf()
ggplot(subset(df,target_outcome==1),aes(x = model_rank)) + 
  stat_ecdf(aes(colour = obs_set), size=1) + 
  scale_x_continuous(limits=c(0,1), labels=percent,breaks=seq(0,1,.1)) +
  xlab("Model Percentile") + ylab("Percent of Target Outcome") +
  scale_y_continuous(limits=c(0,1), labels=percent) +
  geom_segment(aes(x=0,y=0,xend=1,yend=1), 
               colour = "gray", linetype="longdash", size=1) +
  ggtitle("Gain Chart")

enter image description here

All I want to do is force the ECDF to start at (0,0) and end at (1,1) so that there are no gaps at the beginning or end of the curve. If possible, I'd like to do it within the syntax of ggplot2, but I'd settle for a clever workaround.

@Henrik this is NOT a duplicate of this question, because I have already defined my limits with scale_x_ and _y_continuous(), and adding expand_limits() doesn't do anything. It is not the origin of the PLOT but the endpoints of the stat_ecdf() that need fixed.

like image 438
C8H10N4O2 Avatar asked Feb 19 '15 14:02

C8H10N4O2


1 Answers

Unfortunately, the definition of stat_ecdf gives no wiggle room here; it determines the endpoints internally.

There is a somewhat advanced solution. With the latest version of ggplot2 (devtools::install_github("hadley/ggplot2")), the extensibility is improved, to the point where it is possible to override this behavior, but not without some boilerplate.

stat_ecdf2 <- function(mapping = NULL, data = NULL, geom = "step",
                      position = "identity", n = NULL, show.legend = NA,
                      inherit.aes = TRUE, minval=NULL, maxval=NULL,...) {
  layer(
    data = data,
    mapping = mapping,
    stat = StatEcdf2,
    geom = geom,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    stat_params = list(n = n, minval=minval,maxval=maxval),
    params = list(...)
  )
}


StatEcdf2 <- ggproto("StatEcdf2", StatEcdf,
  calculate = function(data, scales, n = NULL, minval=NULL, maxval=NULL, ...) {
    df <- StatEcdf$calculate(data, scales, n, ...)
    if (!is.null(minval)) { df$x[1] <- minval }
    if (!is.null(maxval)) { df$x[length(df$x)] <- maxval }
    df
  }
)

Now, stat_ecdf2 will behave the same as stat_ecdf, but with an optional minval and maxval parameter. So this will do the trick:

ggplot(subset(df,target_outcome==1),aes(x = model_rank)) +
  stat_ecdf2(aes(colour = obs_set), size=1, minval=0, maxval=1) +
  scale_x_continuous(limits=c(0,1), labels=percent,breaks=seq(0,1,.1)) +
  xlab("Model Percentile") + ylab("Percent of Target Outcome") +
  scale_y_continuous(limits=c(0,1), labels=percent) +
  geom_segment(aes(x=0,y=0,xend=1,yend=1),
               colour = "gray", linetype="longdash", size=1) +
  ggtitle("Gain Chart")

The big caveat here is that I don't know if the current extensibility model will be supported in the future; it has changed several times in the past, and the change to use "ggproto" is recent -- like July 15th 2015 recent.

As a plus, this gave me a chance to really dig into ggplot's internals, which is something that I've been meaning to do for a while.

like image 100
user295691 Avatar answered Oct 17 '22 13:10

user295691