Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Fast way to create a sparse model matrix

Tags:

r

I am trying to create a model matrix with a formula that has many interaction terms (some continuous, some 0-1, some factors with many levels). The creation of this model matrix is the bottleneck of my script. In the end the model matrix is 8M rows with 1000 columns. Since the factors with many levels are 0-1 encoded the resulting matrix representing interactions is very sparse, so I already use sparse.model.matrix.

Is there a faster way to generate this matrix? Perhaps in Rcpp?

like image 411
JCWong Avatar asked Jul 12 '15 23:07

JCWong


Video Answer


1 Answers

Have you considered using caret's dummyVars? It works for me and seems reasonably fast.

?dummyVars compares the default behavior of model.matrix and dummyVars, but doesn't say much about it.

For a small performance benchmark on a reproducible example:

n = 1e3 # observations
m = 1e2 # variables
some_levels <- sort(c(LETTERS, letters))
library('microbenchmark')
set.seed(1234)

df <- data.frame(
       lapply(1:m, function(x){
                    switch(sample.int(3,1),    
                           # "some continuous, some 0-1"
                           '1' = rnorm(n), '2' = rbinom(n, 1, 0.5),
                           # "some factors with many levels"       
                           '3' = factor(sample(some_levels, n, TRUE),
                                        levels=some_levels )
                          )
                        })
               )
names(df) <- paste0('V',1:m)

#------------- it sounds like you are doing something like this --------------
frm <- as.formula( paste('~', paste(names(df), collapse='+') ) )
library('Matrix')
microbenchmark(
  mm <- sparse.model.matrix(frm, df)
) # mean = .133 sec (YMMV)

#---------------- you could try something like this --------------------------
library('caret')
microbenchmark(
  mm2 <- dummyVars(frm, df, fullRank=TRUE)
) # mean = .00954 sec (YMMV)

Note fullRank = TRUE so that "factors are encoded to be consistent with model.matrix and the resulting there [sic] are no linear dependencies induced between the columns," per ?dummyVars. You might want to remove fullRank = TRUE to induce the behavior of sparse=TRUE in contr.ltrf as in sparse.model.matrix. I could not find clear documentation.

like image 148
C8H10N4O2 Avatar answered Sep 17 '22 15:09

C8H10N4O2