Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

short formula call for many variables when building a model [duplicate]

Tags:

r

r-formula

I am trying to build a regression model with lm(...). My dataset has lots of features( >50). I do not want to write my code as:

lm(output ~ feature1 + feature2 + feature3 + ... + feature70) 

I was wondering what is the short hand notation to write this code?

like image 296
iinception Avatar asked Apr 25 '11 03:04

iinception


People also ask

What is the problem with having too many variables in a model?

Overfitting occurs when too many variables are included in the model and the model appears to fit well to the current data. Because some of variables retained in the model are actually noise variables, the model cannot be validated in future dataset.


2 Answers

You can use . as described in the help page for formula. The . stands for "all columns not otherwise in the formula".

lm(output ~ ., data = myData).

Alternatively, construct the formula manually with paste. This example is from the as.formula() help page:

xnam <- paste("x", 1:25, sep="") (fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+")))) 

You can then insert this object into regression function: lm(fmla, data = myData).

like image 186
Chase Avatar answered Sep 19 '22 23:09

Chase


Could also try things like:

lm(output ~ myData[,2:71], data=myData) 

Assuming output is the first column feature1:feature70 are the next 70 columns.

Or

features <- paste("feature",1:70, sep="") lm(output ~ myData[,features], data=myData) 

Is probably smarter as it doesn't matter where in amongst your data the columns are.

Might cause issues if there's row's removed for NA's though...

like image 39
nzcoops Avatar answered Sep 18 '22 23:09

nzcoops