Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any pitfalls to using programmatically constructed formulas?

Tags:

r

scoping

I'm wanting to run through a long vector of potential explanatory variables, regressing a response variable on each in turn. Rather than paste together the model formula, I'm thinking of using reformulate(), as demonstrated here.

The function fun() below seems to do the job, fitting the desired model. Notice, though, that it records in its call element the name of the constructed formula object rather than its value.

## (1) Function using programmatically constructed formula
fun <- function(XX) {
    ff <- reformulate(response="mpg", termlabels=XX)
    lm(ff, data=mtcars)
}
fun(XX=c("cyl", "disp"))
# 
# Call:
# lm(formula = ff, data = mtcars)                 <<<--- Note recorded call
# 
# Coefficients:
# (Intercept)          cyl         disp  
#    34.66099     -1.58728     -0.02058  

## (2) Result of directly specified formula (just for purposes of comparison)
lm(mpg ~ cyl + disp, data=mtcars)
# 
# Call:
# lm(formula = mpg ~ cyl + disp, data = mtcars)   <<<--- Note recorded call
# 
# Coefficients:
# (Intercept)          cyl         disp  
#    34.66099     -1.58728     -0.02058  

My question: Is there any danger in this? Can this become a problem if, for instance, I want to later apply update, or predict or some other function to the model fit object, (possibly from some other environment)?

A slightly more awkward alternative that does, nevertheless, get the recorded call right is to use eval(substitute()). Is this in any way a generally safer construct?

fun2 <- function(XX) {
    ff <- reformulate(response="mpg", termlabels=XX)
    eval(substitute(lm(FF, data=mtcars), list(FF=ff)))
}
fun2(XX=c("cyl", "disp"))$call
## lm(formula = mpg ~ cyl + disp, data = mtcars)
like image 417
Josh O'Brien Avatar asked Jun 30 '13 23:06

Josh O'Brien


1 Answers

I'm always hesitant to claim there are no situations in which something involving R environments and scoping might bite, but ... after some more exploration, my first usage above does look safe.

It turns out that the printed call is a bit of red herring.

The formula that actually gets used by other functions (and the one extracted by formula() and as.formula()) is the one stored in the terms element of the fit object, and it gets the actual formula right. (The terms element contains an object of class "terms", which is just a "formula" with a bunch of attached attributes.)

To see that all of the proposals in my question and the associated comments store the same "formula" object (up to the associated environment), run the following.

## First the three approaches in my post
formula(fun(XX=c("cyl", "disp")))
# mpg ~ cyl + disp
# <environment: 0x026d2b7c>

formula(lm(mpg ~ cyl + disp, data=mtcars))
# mpg ~ cyl + disp

formula(fun2(XX=c("cyl", "disp"))$call)
# mpg ~ cyl + disp
# <environment: 0x02c4ce2c>

## Then Gabor Grothendieck's idea
XX = c("cyl", "disp")
ff <- reformulate(response="mpg", termlabels=XX)
formula(do.call("lm", list(ff, quote(mtcars))))  
## mpg ~ cyl + disp

To confirm that formula() really is deriving its output from the terms element of the fit object, have a look at stats:::formula.lm and stats:::formula.terms.

like image 111
Josh O'Brien Avatar answered Sep 27 '22 21:09

Josh O'Brien