R: Fast way to create a sparse model matrix

Question

I am trying to create a model matrix with a formula that has many interaction terms (some continuous, some 0-1, some factors with many levels). The creation of this model matrix is the bottleneck of my script. In the end the model matrix is 8M rows with 1000 columns. Since the factors with many levels are 0-1 encoded the resulting matrix representing interactions is very sparse, so I already use sparse.model.matrix.

Is there a faster way to generate this matrix? Perhaps in Rcpp?

C8H10N4O2 · Accepted Answer

Have you considered using caret's dummyVars? It works for me and seems reasonably fast.

?dummyVars compares the default behavior of model.matrix and dummyVars, but doesn't say much about it.

For a small performance benchmark on a reproducible example:

n = 1e3 # observations
m = 1e2 # variables
some_levels <- sort(c(LETTERS, letters))
library('microbenchmark')
set.seed(1234)

df <- data.frame(
       lapply(1:m, function(x){
                    switch(sample.int(3,1),    
                           # "some continuous, some 0-1"
                           '1' = rnorm(n), '2' = rbinom(n, 1, 0.5),
                           # "some factors with many levels"       
                           '3' = factor(sample(some_levels, n, TRUE),
                                        levels=some_levels )
                          )
                        })
               )
names(df) <- paste0('V',1:m)

#------------- it sounds like you are doing something like this --------------
frm <- as.formula( paste('~', paste(names(df), collapse='+') ) )
library('Matrix')
microbenchmark(
  mm <- sparse.model.matrix(frm, df)
) # mean = .133 sec (YMMV)

#---------------- you could try something like this --------------------------
library('caret')
microbenchmark(
  mm2 <- dummyVars(frm, df, fullRank=TRUE)
) # mean = .00954 sec (YMMV)

Note fullRank = TRUE so that "factors are encoded to be consistent with model.matrix and the resulting there [sic] are no linear dependencies induced between the columns," per ?dummyVars. You might want to remove fullRank = TRUE to induce the behavior of sparse=TRUE in contr.ltrf as in sparse.model.matrix. I could not find clear documentation.

R: Fast way to create a sparse model matrix

Tags:

r

JCWong

Video Answer

1 Answers

C8H10N4O2

Recent Activity

Donate For Us

R: Fast way to create a sparse model matrix

Tags:

r

JCWong

Video Answer

1 Answers

C8H10N4O2

Related questions

Recent Activity

Donate For Us