For the sake of example, consider a basic regression model in R:
form1 <- Petal.Length ~ Sepal.Length + Sepal.Width
fit1 <- lm(form1, iris)
(My apologies to any botanists who post here.)
In order to add quadratic and interaction terms, I know of three approaches:
Entering terms one at a time:
form2 <- . ~ Sepal.Length*Sepal.Width + I(Sepal.Length^2) + I(Sepal.Width^2)
fit2 <- update(fit1, form2)
This doesn't scale beyond small formulas and you can't program with it.
String manipulation:
vars <- attr(terms(form1), "term.labels")
squared_terms <- sprintf("I(%s^2)", vars)
inter_terms <- combn(vars, 2, paste, collapse = "*")
form2 <- reformulate(c(inter_terms, squared_terms), ".")
This scales, but it's not really programmable because the functions themselves need to be hard-coded.
Manipulate the data directly
library(lazyeval)
library(dplyr)
square <- function (v) interp(~ I(v1^2), v1 = as.name(v))
inter <- function(v) interp(~ v1*v2, v1 = as.name(v[1]), v2 = as.name(v[2]))
vars <- attr(terms(form1), "term.labels")
squared_terms <- lapply(vars, square) %>%
set_names(paste0(vars, " ^2"))
inter_terms <- combn(vars, 2, inter, simplify = FALSE) %>%
set_names(combn(vars, 2, paste, collapse = " x "))
fit2 <- model.frame(fit1) %>%
mutate_(.dots = squared_terms) %>%
mutate_(.dots = inter_terms) %>%
lm(Petal.Length ~ ., data = .)
This is fairly scalable, and programmable up to variable naming. But it's also kind of crazy because it defeats the purpose of using a formula
in the first place.
I wish I could do something like this:
library(lazyeval)
library(dplyr)
square <- function (v) interp(~ I(v1^2), v1 = as.name(v))
inter <- function(v) interp(~ v1*v2, v1 = as.name(v[1]), v2 = as.name(v[2]))
squared_terms <- apply.formula(form1, squared_terms)
inter_terms <- combn.formula(form1, 2, inter)
fit2 <- form1 %>%
append.formula(squared_terms) %>%
append.formula(inter_terms) %>%
update(fit1, .)
Abuse of dplyr
aside, there are two killer features here:
Feature 1 is obtainable with Method 3, and Feature 2 is obtainable with Method 2. Is there a Method 4 -- the middle way -- that obtains both at the same time?
OK, there are a lot of moving pieces here, but here are some helper functions that to very specific things
extract_rhs_symbols <- function(x) {
as.list(attr(delete.response(terms(x)), "variables"))[-1]
}
symbols_to_formula <- function(x) {
as.call(list(quote(`~`), x))
}
sum_symbols <- function(...) {
Reduce(function(a,b) bquote(.(a)+.(b)), do.call(`c`, list(...), quote=T))
}
square_terms <- function(x) {
symbols_to_formula(sum_symbols(sapply(extract_rhs_symbols(x), function(x) bquote(I(.(x)^2)))))
}
interact_rhs<-function(x) {
x[[length(as.list(x))]] <- bquote((.(x[[length(as.list(x))]]))^2)
x
}
add_rhs_dot <- function(x) {
x[[length(as.list(x))]] <- sum_symbols(quote(.), x[[length(as.list(x))]])
x
}
add_terms<-function(f, x) {
update(f, add_rhs_dot(x))
}
all of these basically manipulate formulas as calls.
So if you have a formula like
my.formula <- Petal.Length ~ Sepal.Length + Sepal.Width + Other
You can add squared terms with
add_terms(my.formula, square_terms(my.formula))
you can get all the right-hand interactions with
interact_rhs(my.formula)
or do both with
add_terms(interact_rhs(my.formula), square_terms(my.formula))
which gives
Petal.Length ~ Sepal.Length + Sepal.Width + Other + I(Sepal.Length^2) +
I(Sepal.Width^2) + I(Other^2) + Sepal.Length:Sepal.Width +
Sepal.Length:Other + Sepal.Width:Other
I haven't throughly tested this so there are likely to be cases where this breaks down, but it should work in most simple cases.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With