I am trying to build a regression model with lm(...)
. My dataset has lots of features( >50). I do not want to write my code as:
lm(output ~ feature1 + feature2 + feature3 + ... + feature70)
I was wondering what is the short hand notation to write this code?
Overfitting occurs when too many variables are included in the model and the model appears to fit well to the current data. Because some of variables retained in the model are actually noise variables, the model cannot be validated in future dataset.
You can use .
as described in the help page for formula
. The .
stands for "all columns not otherwise in the formula".
lm(output ~ ., data = myData)
.
Alternatively, construct the formula manually with paste
. This example is from the as.formula()
help page:
xnam <- paste("x", 1:25, sep="") (fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+"))))
You can then insert this object into regression function: lm(fmla, data = myData)
.
Could also try things like:
lm(output ~ myData[,2:71], data=myData)
Assuming output is the first column feature1:feature70 are the next 70 columns.
Or
features <- paste("feature",1:70, sep="") lm(output ~ myData[,features], data=myData)
Is probably smarter as it doesn't matter where in amongst your data the columns are.
Might cause issues if there's row's removed for NA's though...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With