Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save mapping of data.frame-to-model.matrix and apply to new observations?

Some modeling functions, e.g. glmnet(), require (or just allow for) the data to be passed in as a predictor matrix and a response matrix (or vector) as apposed to using a formula. In these cases, it's typically the case that the predict() method, e.g. predict.glmnet(), requires that the newdata argument provides a predictor matrix with the same features as was used to train the model.

A convenient way to create a predictor matrix when your dataframe has factors (R's categorical data type) is to use the model.matrix() function, which automatically creates dummy features for your categorical variables:

# this is the dataframe and matrix I want to use to train the model
set.seed(1)
df <- data.frame(x1 = factor(sample(LETTERS[1:5], replace = T, 20)),
                 x2 = rnorm(20, 100, 5),
                 x3 = factor(sample(c("U","L"), replace = T, 20)),
                 y = rnorm(20, 10, 2))

mm <- model.matrix(y~., data = df)

But when I introduce a dataframe with new observations that contain only a subset of the levels of the factors from the original dataframe, model.matrix() (predictably) returns a matrix with different dummy features. This new matrix cannot be used in predict.glm() because it doesn't have the same features that the model is expecting:

# this is the dataframe and matrix I want to predict on
set.seed(1)
df_new <- data.frame(x1 = c("B", "C"),
                     x2 = rnorm(2, 100, 5),
                     x3 = c("L","U"))

mm_new <- model.matrix(~., data = df_new)

Is there a way to save the transformation (creating all necessary dummy features) from a dataframe to a model matrix so that I can re-apply this transformation to future observations? In my above example, this would ideally result in mm_new having identical feature names as mm so that predict() can accept mm_new.

I want to add that I'm aware of this approach, which essentially suggests to include the observations from df_new in df before calling model.matrix(). This work fine if I have all the observations to begin with, and I'm just training and testing models. However, the new observations will only be accessible in the future (in a production prediction pipeline), and I want to avoid the overhead of re-loading the entire training dataframe for new predictions.

like image 975
ishak Avatar asked Feb 05 '23 14:02

ishak


1 Answers

I found exactly what I needed available in the documentation for model.matrix and model.frame, and wanted to share. There is an argument in model.matrix called xlev which is "to be used as argument of model.frame if data is such that model.frame is called."

If model.matrix calls model.frame, xlev expects a list of character vectors for each factor in the dataframe (with the list element name being the factor name); each character vector contains the full set of factor levels needed to build the new model.matrix with the same dummy features as the original model.matrix.

Here's a working example:

set.seed(1)
df <- data.frame(x1 = factor(sample(LETTERS[1:5], replace = T, 20)),
                 x2 = rnorm(20, 100, 5),
                 x3 = factor(sample(c("U","L"), replace = T, 20)),
                 y = rnorm(20, 10, 2))

mm <- model.matrix(y~., data = df)

# this is a list of levels for each factor in the original df
xlevs <- lapply(df[,sapply(df, is.factor), drop = F], function(j){
  levels(j)
})

# this is a new df with only a subset of the levels of the original factors
df_new <- data.frame(x1 = c("B", "C"),
                     x2 = rnorm(2, 100, 5),
                     x3 = c("U","U"))

# calling "xlev = " builds out a model.matrix with identical levels as the original df
mm_new <- model.matrix(~., data = df_new[1,], xlev = xlevs)

Note that this solution only handles factor levels that are a subset of the original factor levels. It isn't intended to handle new factor levels.

like image 141
ishak Avatar answered Feb 07 '23 09:02

ishak