Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Automatically create formulas for all possible linear models

Tags:

r

Say I have a training set in a data frame train with columns ColA, ColB, ColC, etc. One of these columns designates a binary class, say column Class, with "yes" or "no" values.

I'm trying out some binary classifiers, e.g.:

library(klaR)
mynb <- NaiveBayes(Class ~ ColA + ColB + ColC, train)

I would like to run the above code in a loop, automatically generating all possible combinations of columns in the formula, i.e.:

mynb <- append(mynb, NaiveBayes(Class ~ ColA, train)
mynb <- append(mynb, NaiveBayes(Class ~ ColA + ColB, train)
mynb <- append(mynb, NaiveBayes(Class ~ ColA + ColB + ColC, train)
...
mynb <- append(mynb, NaiveBayes(Class ~ ColB + ColC + ColD, train)
...

How can I automatically generate formulas for each possible linear model involving columns of a data frame?

like image 889
Leo Avatar asked Mar 14 '11 15:03

Leo


People also ask

What are the 3 types of linear model?

Simple linear regression: models using only one predictor. Multiple linear regression: models using multiple predictors. Multivariate linear regression: models for multiple response variables.

How do you write a linear model equation?

The formula for a linear model is y=mx+b. The y represents the output value, the m represents the rate of change, the x represents the input value, and the b represents the constant.

What is all possible regression?

All-possible-regressions goes beyond stepwise regression and literally tests all possible subsets of the set of potential independent variables. (This is the "Regression Model Selection" procedure in Statgraphics.)

What is the generalized formula for linear regression?

The General Linear Modely = a set of outcome variables. x = a set of pre-program variables or covariates. b0 = the set of intercepts (value of each y when each x = 0 ) b = a set of coefficients, one each for each x.


2 Answers

Say we work with this ridiculous example :

DF <- data.frame(Class=1:10,A=1:10,B=1:10,C=1:10)

Then you get the names of the columns

Cols <- names(DF)
Cols <- Cols[! Cols %in% "Class"]
n <- length(Cols)

You construct all possible combinations

id <- unlist(
        lapply(1:n,
              function(i)combn(1:n,i,simplify=FALSE)
        )
      ,recursive=FALSE)

You paste them to formulas

Formulas <- sapply(id,function(i)
              paste("Class~",paste(Cols[i],collapse="+"))
            )

And you loop over them to apply the models.

lapply(Formulas,function(i)
    lm(as.formula(i),data=DF))

Be warned though: if you have more than a handful columns, this will quickly become very heavy on the memory and result in literally thousands of models. You have 2^n - 1 different models with n being the number of columns.

Make very sure that is what you want, in general this kind of model comparison is strongly advised against. Forget about any kind of inference as well when you do this.

like image 118
Joris Meys Avatar answered Oct 04 '22 12:10

Joris Meys


Here is an excellent blog post by Mark Heckman, detailing how to construct all possible regression models, given a set of explanatory variables and a response variable. However, as pointed out by Joris, I would strictly caution against using such an approach since (a) the number of regressions increases exponentially and (b) statistical experts don't recommend data fishing of this kind, as it is fraught with all kinds of risks.

like image 29
Ramnath Avatar answered Oct 04 '22 12:10

Ramnath