Say I have a training set in a data frame train
with columns ColA
, ColB
, ColC
, etc. One of these columns designates a binary class, say column Class
, with "yes" or "no" values.
I'm trying out some binary classifiers, e.g.:
library(klaR)
mynb <- NaiveBayes(Class ~ ColA + ColB + ColC, train)
I would like to run the above code in a loop, automatically generating all possible combinations of columns in the formula, i.e.:
mynb <- append(mynb, NaiveBayes(Class ~ ColA, train)
mynb <- append(mynb, NaiveBayes(Class ~ ColA + ColB, train)
mynb <- append(mynb, NaiveBayes(Class ~ ColA + ColB + ColC, train)
...
mynb <- append(mynb, NaiveBayes(Class ~ ColB + ColC + ColD, train)
...
How can I automatically generate formulas for each possible linear model involving columns of a data frame?
Simple linear regression: models using only one predictor. Multiple linear regression: models using multiple predictors. Multivariate linear regression: models for multiple response variables.
The formula for a linear model is y=mx+b. The y represents the output value, the m represents the rate of change, the x represents the input value, and the b represents the constant.
All-possible-regressions goes beyond stepwise regression and literally tests all possible subsets of the set of potential independent variables. (This is the "Regression Model Selection" procedure in Statgraphics.)
The General Linear Modely = a set of outcome variables. x = a set of pre-program variables or covariates. b0 = the set of intercepts (value of each y when each x = 0 ) b = a set of coefficients, one each for each x.
Say we work with this ridiculous example :
DF <- data.frame(Class=1:10,A=1:10,B=1:10,C=1:10)
Then you get the names of the columns
Cols <- names(DF)
Cols <- Cols[! Cols %in% "Class"]
n <- length(Cols)
You construct all possible combinations
id <- unlist(
lapply(1:n,
function(i)combn(1:n,i,simplify=FALSE)
)
,recursive=FALSE)
You paste them to formulas
Formulas <- sapply(id,function(i)
paste("Class~",paste(Cols[i],collapse="+"))
)
And you loop over them to apply the models.
lapply(Formulas,function(i)
lm(as.formula(i),data=DF))
Be warned though: if you have more than a handful columns, this will quickly become very heavy on the memory and result in literally thousands of models. You have 2^n - 1 different models with n being the number of columns.
Make very sure that is what you want, in general this kind of model comparison is strongly advised against. Forget about any kind of inference as well when you do this.
Here is an excellent blog post by Mark Heckman, detailing how to construct all possible regression models, given a set of explanatory variables and a response variable. However, as pointed out by Joris, I would strictly caution against using such an approach since (a) the number of regressions increases exponentially and (b) statistical experts don't recommend data fishing of this kind, as it is fraught with all kinds of risks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With