Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calling predict() inside an R function

Tags:

r

I'd like to call predict(...) inside a function in R. I'm having some trouble related to scoping, but I can't quite figure out what's wrong or how to fix it. Can anyone help? Example:

df <- data.frame(x=1:20, binary.outcome=1*(runif(20, 0, 1) > 0.60))
summary(df)
logit.model <- glm(df$binary.outcome ~ df$x, family=binomial("logit"), data=df)
summary(logit.model)

PredictOnNewData <- function() {
  df <- data.frame(x=51:100)
  df$probability <- round(predict(logit.model, df, type="response"), digits=3)
  return(df)
}

PredictOnNewData()

The last line fails with:

Error in $<-.data.frame(*tmp*, "probability", value = c(0.274, 0.282, : replacement has 20 rows, data has 50 In addition: Warning message: 'newdata' had 50 rows but variable(s) found have 20 rows

If I'm understanding the error message correctly, it looks like the df object I'm passing to predict(...) is being evaluated as the df in the parent / global environment. That one has 20 rows and was used for training. But I want the call to predict(...) to be evaluated on the other df data frame -- the one I create inside the PredictOnNewData function. How can I make that happen (without changing the names of my data frames)?

[Now that I re-read this -- am I getting this backwards? In the line (df$probability <- ...), one of the dfs is being evaluated the wrong way, but which is it?]

I've also tried get("df", envir=sys.frame()), to be explicit about wanting the df object defined in the current function frame:

PredictOnNewData <- function() {
  df <- data.frame(x=51:100)
  # df$probability <- round(predict(logit.model, df, type="response"), digits=3)
  df$probability <- round(predict(logit.model, get("df", envir=sys.frame()), type="response"), digits=3)
  return(df)
}

PredictOnNewData()

...returns the same error as last time.

Please help!


It's definitely possible to call predict on a data frame larger than what was used as training data. An example (runs correctly):

df <- data.frame(x=1:20, binary.outcome=1*(runif(20, 0, 1) > 0.60))
summary(df)
logit.model <- glm(df$binary.outcome ~ df$x, family=binomial("logit"), data=df)
summary(logit.model)
df <- data.frame(x=1:100)
df$probability <- round(predict(logit.model, df, type="response"), digits=3)
df

That's exactly what I want to do -- except that I want the second df to be created by a function. How can I do that?

like image 482
Adrian Avatar asked Feb 10 '11 00:02

Adrian


1 Answers

You should use the data and formula arguments properly if you want predict to work properly. The data argument is a data-frame and the formula argument is composed of column names and (formula) operators. I also don't like the implicit wild extrapolation outside the range of the development domain, but we will ignore that for now. Try this minor modification:

df <- data.frame(x=1:20, binary.outcome=1*(runif(20, 0, 1) > 0.60))
summary(df)
logit.model <- glm( binary.outcome ~ x, family=binomial("logit"), data=df)
summary(logit.model)

PredictOnNewData <- function() {
  df <- data.frame(x=51:100)
  df$probability <- round(predict( logit.model, newdata=df, type="response"), digits=3)
  return(df)
}

PredictOnNewData()
like image 131
IRTFM Avatar answered Sep 28 '22 19:09

IRTFM