Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - model.frame() and non-standard evaluation

I am puzzled at a behaviour of a function that I am trying to write. My example comes from the survival package but I think that the question is more general than that. Basically, the following code

library(survival)
data(bladder)  ## this will load "bladder", "bladder1" and "bladder2"

mod_init <- coxph(Surv(start, stop, event) ~ rx + number, data = bladder2, method = "breslow")
survfit(mod_init)

Will yield an object that I am interested in. However, when I write it in a function,

my_function <- function(formula, data) {
  mod_init <- coxph(formula = formula, data = data, method = "breslow")
  survfit(mod_init)
  }

my_function(Surv(start, stop, event) ~ rx + number, data = bladder2)

the function will return an error at the last line:

 Error in eval(predvars, data, env) : 
  invalid 'envir' argument of type 'closure' 
10 eval(predvars, data, env) 
9 model.frame.default(formula = Surv(start, stop, event) ~ rx + 
    number, data = data) 
8 stats::model.frame(formula = Surv(start, stop, event) ~ rx + 
    number, data = data) 
7 eval(expr, envir, enclos) 
6 eval(temp, environment(formula$terms), parent.frame()) 
5 model.frame.coxph(object) 
4 stats::model.frame(object) 
3 survfit.coxph(mod_init) 
2 survfit(mod_init) 
1 my_function(Surv(start, stop, event) ~ rx + number, data = bladder2) 

I am curious whether there is something obvious that I am missing or whether such behaviour is normal. I find it strange, since in the environment of my_function I would have the same objects as in the global environment when running the first portion of the code.

Edit: I also received useful input from Terry Therneau, the author of the survival package. This is his answer:

This is a problem that stems from the non-standard evaluation done by model.frame. The only way out of it that I have found is to add model.frame=TRUE to the original coxph call. I consider it a serious design flaw in R. Non-standard evaluation is like the dark side -- a tempting and easy path that always ends badly. Terry T.

like image 517
Theodor Avatar asked May 21 '16 15:05

Theodor


1 Answers

Diagnose

From the error message:

2 survfit(mod_init, newdata = base_case)
1 my_function(Surv(start, stop, event) ~ rx + number, data = bladder2) 

the problem is clearly not with coxph during model fitting, but with survfit.

And from this message:

10 eval(predvars, data, env) 
 9 model.frame.default(formula = Surv(start, stop, event) ~ rx + 
     number, data = data) 

I can tell that the problem is that during early stage of survfit, the function model.frame.default() can not find a model frame containing relevant data used in formula Surv(start, stop, event) ~ rx + number. Hence it complains.


What is a model frame?

A model frame, is formed from the data argument passed to fitting routine, like lm(), glm() and mgcv:::gam(). It is a data frame with the same number of rows as data, but:

  • dropping all variables not referenced by formula
  • adding many attributes, the most important of which is envrionement

Most model fitting routines, like lm(), glm(), and mgcv:::gam(), will keep the model frame in their fitted object by default. This has advantage that if we later call predict, and no newdata is provided, it will find data from this model frame for evaluation. However, a clear disadvantage is that it will substantially increase the size of your fitted object.

However, survival:::coxph() is an exception. It will by default not retain such model frame in their fitted object. Well, clearly, this makes the resulting fitted object much smaller in size, but, expose you to the problem you have encountered. If we want to ask survival:::coxph() to keep this model frame, then use model = TRUE of this function.


Test with survial:::coxph()

library(survival); data(bladder)

my_function <- function(myformula, mydata, keep.mf = TRUE) {
  fit <- coxph(myformula, mydata, method = "breslow", model = keep.mf)
  survfit(fit)
  }

Now, this function call will fail, as you have seen:

my_function(Surv(start, stop, event) ~ rx + number, bladder2, keep.mf = FALSE)

but this function call will succeed:

my_function(Surv(start, stop, event) ~ rx + number, bladder2, keep.mf = TRUE)

Same behaviour for lm()

We can actually demonstrate the same behaviour in lm():

## generate some toy data
foo <- data.frame(x = seq(0, 1, length = 20), y = seq(0, 1, length = 20) + rnorm(20, 0, 0.15))

## a wrapper function
bar <- function(myformula, mydata, keep.mf = TRUE) {
  fit <- lm(myformula, mydata, model = keep.mf)
  predict.lm(fit)
  }

Now this will succeed, by keeping model frame:

bar(y ~ x - 1, foo, keep.mf = TRUE)

while this will fail, by discarding model frame:

bar(y ~ x - 1, foo, keep.mf = FALSE)

Using argument newdata?

Note that my example for lm() is slightly artificial, because we can actually use newdata argument in predict.lm() to get through this problem:

bar1 <- function(myformula, mydata, keep.mf = TRUE) {
  fit <- lm(myformula, mydata, model = keep.mf)
  predict.lm(fit, newdata = lapply(mydata, mean))
  }

Now whether we keep model frame, both will succeed:

bar1(y ~ x - 1, foo, keep.mf = TRUE)
bar1(y ~ x - 1, foo, keep.mf = FALSE)

Then you may wonder: can we do the same for survfit()?

survfit() is a generic function, in your code, you are really calling survfit.coxph(). There is indeed a newdata argument for this function. The documentation reads:

newdata:

a data frame with the same variable names as those that appear in the ‘coxph’ formula. ... ... Default is the mean of the covariates used in the ‘coxph’ fit.

So, let's try:

my_function1 <- function(myformula, mydata) {
  mtrace.off()
  fit <- coxph(myformula, mydata, method = "breslow")
  survival:::survfit.coxph(fit, newdata = lapply(mydata, mean))
  }

and we hope this work:

my_function1(Surv(start, stop, event) ~ rx + number, bladder2)

But:

Error in is.data.frame(data) (from #5) : object 'mydata' not found

1: my_function1(Surv(start, stop, event) ~ rx + number, bladder2)
2: #5: survival:::survfit.coxph(fit, lapply(mydata, mean))
3: stats::model.frame(object)
4: model.frame.coxph(object)
5: eval(temp, environment(formula$terms), parent.frame())
6: eval(expr, envir, enclos)
7: stats::model.frame(formula = Surv(start, stop, event) ~ rx + number, data =
8: model.frame.default(formula = Surv(start, stop, event) ~ rx + number, data 
9: is.data.frame(data)

Note that although we pass in newdata, it is not used in construction of model frame:

3: stats::model.frame(object)

Only object, a copy of fitted model, is passed to model.frame.default().

This is very different from what happens in predict.lm(), predict.glm() and mgcv:::predict.gam(). In these routines, newdata is passed to model.frame.default(). For example, in lm(), there is:

m <- model.frame(Terms, newdata, na.action = na.action, xlev = object$xlevels)

I don't use survival package, so not sure how newdata works in this package. So I think we really need some expert explaining this.

like image 79
Zheyuan Li Avatar answered Sep 28 '22 16:09

Zheyuan Li