Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Programmatically building formulas without string manipulation

Tags:

string

r

For the sake of example, consider a basic regression model in R:

form1 <- Petal.Length ~ Sepal.Length + Sepal.Width
fit1 <- lm(form1, iris)

(My apologies to any botanists who post here.)

In order to add quadratic and interaction terms, I know of three approaches:

1) the old-fashioned way

Entering terms one at a time:

form2 <- . ~ Sepal.Length*Sepal.Width + I(Sepal.Length^2) + I(Sepal.Width^2)
fit2 <- update(fit1, form2)

This doesn't scale beyond small formulas and you can't program with it.

2) the ugly way

String manipulation:

vars <- attr(terms(form1), "term.labels")
squared_terms <- sprintf("I(%s^2)", vars)
inter_terms <- combn(vars, 2, paste, collapse = "*")
form2 <- reformulate(c(inter_terms, squared_terms), ".")

This scales, but it's not really programmable because the functions themselves need to be hard-coded.

3) the "back entrance"

Manipulate the data directly

library(lazyeval)
library(dplyr)

square <- function (v) interp(~ I(v1^2), v1 = as.name(v))
inter <- function(v) interp(~ v1*v2, v1 = as.name(v[1]), v2 = as.name(v[2]))

vars <- attr(terms(form1), "term.labels")
squared_terms <- lapply(vars, square) %>%
  set_names(paste0(vars, " ^2"))
inter_terms <- combn(vars, 2, inter, simplify = FALSE) %>%
  set_names(combn(vars, 2, paste, collapse = " x "))

fit2 <- model.frame(fit1) %>%
  mutate_(.dots = squared_terms) %>%
  mutate_(.dots = inter_terms) %>%
  lm(Petal.Length ~ ., data = .)

This is fairly scalable, and programmable up to variable naming. But it's also kind of crazy because it defeats the purpose of using a formula in the first place.

what I wish I could do

I wish I could do something like this:

library(lazyeval)
library(dplyr)

square <- function (v) interp(~ I(v1^2), v1 = as.name(v))
inter <- function(v) interp(~ v1*v2, v1 = as.name(v[1]), v2 = as.name(v[2]))

squared_terms <- apply.formula(form1, squared_terms)
inter_terms <- combn.formula(form1, 2, inter)

fit2 <- form1 %>%
  append.formula(squared_terms) %>%
  append.formula(inter_terms) %>%
  update(fit1, .)

Abuse of dplyr aside, there are two killer features here:

  1. the ability to programmatically generate arbitrary formula terms out of basic R objects
  2. the ability to add terms to a formula that will behave like terms that were typed in by hand

Feature 1 is obtainable with Method 3, and Feature 2 is obtainable with Method 2. Is there a Method 4 -- the middle way -- that obtains both at the same time?

like image 305
shadowtalker Avatar asked Dec 25 '22 20:12

shadowtalker


1 Answers

OK, there are a lot of moving pieces here, but here are some helper functions that to very specific things

extract_rhs_symbols <- function(x) {
    as.list(attr(delete.response(terms(x)), "variables"))[-1]
}
symbols_to_formula <- function(x) {
    as.call(list(quote(`~`), x))    
}
sum_symbols <- function(...) {
    Reduce(function(a,b) bquote(.(a)+.(b)), do.call(`c`, list(...), quote=T))
}
square_terms <- function(x) {
    symbols_to_formula(sum_symbols(sapply(extract_rhs_symbols(x), function(x) bquote(I(.(x)^2)))))
}
interact_rhs<-function(x) {
    x[[length(as.list(x))]] <- bquote((.(x[[length(as.list(x))]]))^2)
    x
}
add_rhs_dot <- function(x) {
   x[[length(as.list(x))]] <- sum_symbols(quote(.), x[[length(as.list(x))]])    
   x
}
add_terms<-function(f, x) {
    update(f, add_rhs_dot(x))
}

all of these basically manipulate formulas as calls.

So if you have a formula like

my.formula <- Petal.Length ~ Sepal.Length + Sepal.Width + Other

You can add squared terms with

add_terms(my.formula, square_terms(my.formula))

you can get all the right-hand interactions with

interact_rhs(my.formula)

or do both with

add_terms(interact_rhs(my.formula), square_terms(my.formula))

which gives

Petal.Length ~ Sepal.Length + Sepal.Width + Other + I(Sepal.Length^2) + 
    I(Sepal.Width^2) + I(Other^2) + Sepal.Length:Sepal.Width + 
    Sepal.Length:Other + Sepal.Width:Other

I haven't throughly tested this so there are likely to be cases where this breaks down, but it should work in most simple cases.

like image 198
MrFlick Avatar answered Jan 31 '23 08:01

MrFlick