Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ggplot2: stat_smooth for logistic outcomes with facet_wrap returning 'full' or 'subset' glm models

Tags:

plot

r

ggplot2

I am working on a logistic regression model with one continuous predictor and one categorical predictor with several levels. I want to present the results using ggplot2 and exploiting the facet_wrap to show the regression lines for each level of the categorical predictor. When doing this I noticed the fitted curve provided by stat_smooth only considers the data in a particular facet, not the whole data set. This is a small difference, but a noticeable one when looking at the plot versus predicted values returned from predict.glm.

Here is an example recreating the issue with the graphic following the code.

library(boot)    # needed for inv.logit function
library(ggplot2) # version 0.8.9

set.seed(42)
n <- 100

df <- data.frame(location = rep(LETTERS[1:4], n),
                 score    = sample(45:80, 4*n, replace = TRUE))

df$p    <- inv.logit(0.075 * df$score + rep(c(-4.5, -5, -6, -2.8), n))
df$pass <- sapply(df$p, function(x){rbinom(1, 1, x)}) 

gplot <- ggplot(df, aes(x = score, y = pass)) + 
            geom_point() + 
            facet_wrap( ~ location) + 
            stat_smooth(method = 'glm', family = 'binomial') 

# 'full' logistic model
g <- glm(pass ~ location + score, data = df, family = 'binomial')
summary(g)

# new.data for predicting new observations
new.data <- expand.grid(score    = seq(46, 75, length = n), 
                        location = LETTERS[1:4])

new.data$pred.full <- predict(g, newdata = new.data, type = 'response')

pred.sub <- NULL
for(i in LETTERS[1:4]){
  pred.sub <- c(pred.sub,
    predict(update(g, formula = . ~ score, subset = location %in% i), 
            newdata = data.frame(score = seq(46, 75, length = n)), 
            type = 'response'))
}

new.data$pred.sub <- pred.sub

gplot + 
  geom_line(data = new.data, aes(x = score, y = pred.full), color = 'green') + 
  geom_line(data = new.data, aes(x = score, y = pred.sub),  color = 'red')

enter image description here

What I noted and am concerned about is ease to see in facet B. The red curves are the predicted values from models only considering one location, whereas the green curves are predictions using the full data set. The models based on the subset of the data match the plot from stat_smooth.

I would like to plot, with standard error shading, the green curves via ggplot2. I'm sure there is an option somewhere in the code I could use that would do this, but I have yet to find it, or perhaps there is a different order or steps I should follow to get the green curves from a ggplot call. I have found similar problems when plotting everything on one facet and using the color or group aesthetic.

Any suggestions would be greatly appreciated.

like image 566
Peter Avatar asked Dec 28 '11 22:12

Peter


1 Answers

You're correct that the way to do this is to fit the model outside of ggplot2 and then calculate the fitted values and intervals how you like and pass that data in separately.

One way to achieve what you describe would be something like this:

preds <- predict(g, newdata = new.data, type = 'response',se = TRUE)
new.data$pred.full <- preds$fit

new.data$ymin <- new.data$pred.full - 2*preds$se.fit
new.data$ymax <- new.data$pred.full + 2*preds$se.fit  

ggplot(df,aes(x = score, y = pass)) + 
    facet_wrap(~location) + 
    geom_point() + 
    geom_ribbon(data = new.data,aes(y = pred.full, ymin = ymin, ymax = ymax),alpha = 0.25) +
    geom_line(data = new.data,aes(y = pred.full),colour = "blue")

enter image description here

This comes with the usual warnings about intervals on fitted values: it's up to you to make sure that the interval you're plotting is what you really want. There tends to be a lot of confusion about "prediction intervals".

like image 125
joran Avatar answered Nov 06 '22 15:11

joran