In the last few months I've worked on a number of projects where I've used the glmnet
package to fit elastic net models. It's great, but the interface is rather bare-bones compared to most R modelling functions. In particular, rather than specifying a formula and data frame, you have to give a response vector and predictor matrix. You also lose out on many quality-of-life things that the regular interface provides, eg sensible (?) treatment of factors, missing values, putting variables into the correct order, etc.
So I've generally ended up writing my own code to recreate the formula/data frame interface. Due to client confidentiality issues, I've also ended up leaving this code behind and having to write it again for the next project. I figured I might as well bite the bullet and create an actual package to do this. However, a couple of questions before I do so:
The formula interface in R allows you to make transformations of the input data frame automatically. For example, categorical (or factor) columns will generate the appropriate dummy variables.
Glmnet is a package that fits generalized linear and similar models via penalized maximum likelihood. The regularization path is computed for the lasso or elastic net penalty at a grid of values (on the log scale) for the regularization parameter lambda.
Well, it looks like there's no pre-built formula interface, so I went ahead and made my own. You can download it from Github: https://github.com/Hong-Revo/glmnetUtils
Or in R, using devtools::install_github
:
install.packages("devtools")
library(devtools)
install_github("hong-revo/glmnetUtils")
library(glmnetUtils)
From the readme:
Some quality-of-life functions to streamline the process of fitting elastic net models with
glmnet
, specifically:
glmnet.formula
provides a formula/data frame interface toglmnet
.cv.glmnet.formula
does a similar thing forcv.glmnet
.- Methods for
predict
andcoef
for both the above.- A function
cvAlpha.glmnet
to choose both the alpha and lambda parameters via cross-validation, following the approach described in the help page forcv.glmnet
. Optionally does the cross-validation in parallel.- Methods for
plot
,predict
andcoef
for the above.
Incidentally, while writing the above, I think I realised why nobody has done this before. Central to R's handling of model frames and model matrices is a terms
object, which includes a matrix with one row per variable and one column per main effect and interaction. In effect, that's (at minimum) roughly a p x p matrix, where p is the number of variables in the model. When p is 16000, which is common these days with wide data, the resulting matrix is about a gigabyte in size.
Still, I haven't had any problems (yet) working with these objects. If it becomes a major issue, I'll see if I can find a workaround.
I've pushed an update to the repo, to address the above issue as well as one related to factors. From the documentation:
There are two ways in which glmnetUtils can generate a model matrix out of a formula and data frame. The first is to use the standard R machinery comprising
model.frame
andmodel.matrix
; and the second is to build the matrix one variable at a time. These options are discussed and contrasted below.Using model.frame
This is the simpler option, and the one that is most compatible with other R modelling functions. The
model.frame
function takes a formula and data frame and returns a model frame: a data frame with special information attached that lets R make sense of the terms in the formula. For example, if a formula includes an interaction term, the model frame will specify which columns in the data relate to the interaction, and how they should be treated. Similarly, if the formula includes expressions likeexp(x)
orI(x^2)
on the RHS,model.frame
will evaluate these expressions and include them in the output.The major disadvantage of using
model.frame
is that it generates a terms object, which encodes how variables and interactions are organised. One of the attributes of this object is a matrix with one row per variable, and one column per main effect and interaction. At minimum, this is (approximately) a p x p square matrix where p is the number of main effects in the model. For wide datasets with p > 10000, this matrix can approach or exceed a gigabyte in size. Even if there is enough memory to store such an object, generating the model matrix can take a significant amount of time.Another issue with the standard R approach is the treatment of factors. Normally,
model.matrix
will turn an N-level factor into an indicator matrix with N-1 columns, with one column being dropped. This is necessary for unregularised models as fit with lm and glm, since the full set of N columns is linearly dependent. With the usual treatment contrasts, the interpretation is that the dropped column represents a baseline level, while the coefficients for the other columns represent the difference in the response relative to the baseline.This may not be appropriate for a regularised model as fit with glmnet. The regularisation procedure shrinks the coefficients towards zero, which forces the estimated differences from the baseline to be smaller. But this only makes sense if the baseline level was chosen beforehand, or is otherwise meaningful as a default; otherwise it is effectively making the levels more similar to an arbitrarily chosen level.
Manually building the model matrix
To deal with the problems above, glmnetUtils by default will avoid using
model.frame
, instead building up the model matrix term-by-term. This avoids the memory cost of creating aterms
object, and can be noticeably faster than the standard approach. It will also include one column in the model matrix for all levels in a factor; that is, no baseline level is assumed. In this situation, the coefficients represent differences from the overall mean response, and shrinking them to zero is meaningful (usually).The main downside of not using
model.frame
is that the formula can only be relatively simple. At the moment, only straightforward formulas likey ~ x1 + x2 + ... + x_p
are handled by the code, where the x's are columns already present in the data. Interaction terms and computed expressions are not supported. Where possible, you should compute such expressions beforehand.
After a few hiccups, this is finally on CRAN.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With