Some modeling functions, e.g. glmnet()
, require (or just allow for) the data to be passed in as a predictor matrix and a response matrix (or vector) as apposed to using a formula. In these cases, it's typically the case that the predict()
method, e.g. predict.glmnet()
, requires that the newdata
argument provides a predictor matrix with the same features as was used to train the model.
A convenient way to create a predictor matrix when your dataframe has factors (R's categorical data type) is to use the model.matrix()
function, which automatically creates dummy features for your categorical variables:
# this is the dataframe and matrix I want to use to train the model
set.seed(1)
df <- data.frame(x1 = factor(sample(LETTERS[1:5], replace = T, 20)),
x2 = rnorm(20, 100, 5),
x3 = factor(sample(c("U","L"), replace = T, 20)),
y = rnorm(20, 10, 2))
mm <- model.matrix(y~., data = df)
But when I introduce a dataframe with new observations that contain only a subset of the levels of the factors from the original dataframe, model.matrix()
(predictably) returns a matrix with different dummy features. This new matrix cannot be used in predict.glm()
because it doesn't have the same features that the model is expecting:
# this is the dataframe and matrix I want to predict on
set.seed(1)
df_new <- data.frame(x1 = c("B", "C"),
x2 = rnorm(2, 100, 5),
x3 = c("L","U"))
mm_new <- model.matrix(~., data = df_new)
Is there a way to save the transformation (creating all necessary dummy features) from a dataframe to a model matrix so that I can re-apply this transformation to future observations? In my above example, this would ideally result in mm_new
having identical feature names as mm
so that predict()
can accept mm_new
.
I want to add that I'm aware of this approach, which essentially suggests to include the observations from df_new
in df
before calling model.matrix()
. This work fine if I have all the observations to begin with, and I'm just training and testing models. However, the new observations will only be accessible in the future (in a production prediction pipeline), and I want to avoid the overhead of re-loading the entire training dataframe for new predictions.
I found exactly what I needed available in the documentation for model.matrix
and model.frame
, and wanted to share. There is an argument in model.matrix
called xlev
which is "to be used as argument of model.frame
if data is such that model.frame
is called."
If model.matrix
calls model.frame
, xlev
expects a list of character vectors for each factor in the dataframe (with the list element name being the factor name); each character vector contains the full set of factor levels needed to build the new model.matrix
with the same dummy features as the original model.matrix
.
Here's a working example:
set.seed(1)
df <- data.frame(x1 = factor(sample(LETTERS[1:5], replace = T, 20)),
x2 = rnorm(20, 100, 5),
x3 = factor(sample(c("U","L"), replace = T, 20)),
y = rnorm(20, 10, 2))
mm <- model.matrix(y~., data = df)
# this is a list of levels for each factor in the original df
xlevs <- lapply(df[,sapply(df, is.factor), drop = F], function(j){
levels(j)
})
# this is a new df with only a subset of the levels of the original factors
df_new <- data.frame(x1 = c("B", "C"),
x2 = rnorm(2, 100, 5),
x3 = c("U","U"))
# calling "xlev = " builds out a model.matrix with identical levels as the original df
mm_new <- model.matrix(~., data = df_new[1,], xlev = xlevs)
Note that this solution only handles factor levels that are a subset of the original factor levels. It isn't intended to handle new factor levels.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With