I am trying to create a model matrix with a formula that has many interaction terms (some continuous, some 0-1, some factors with many levels). The creation of this model matrix is the bottleneck of my script. In the end the model matrix is 8M rows with 1000 columns. Since the factors with many levels are 0-1 encoded the resulting matrix representing interactions is very sparse, so I already use sparse.model.matrix
.
Is there a faster way to generate this matrix? Perhaps in Rcpp?
Have you considered using caret
's dummyVars
? It works for me and seems reasonably fast.
?dummyVars
compares the default behavior of model.matrix
and dummyVars
, but doesn't say much about it.
For a small performance benchmark on a reproducible example:
n = 1e3 # observations
m = 1e2 # variables
some_levels <- sort(c(LETTERS, letters))
library('microbenchmark')
set.seed(1234)
df <- data.frame(
lapply(1:m, function(x){
switch(sample.int(3,1),
# "some continuous, some 0-1"
'1' = rnorm(n), '2' = rbinom(n, 1, 0.5),
# "some factors with many levels"
'3' = factor(sample(some_levels, n, TRUE),
levels=some_levels )
)
})
)
names(df) <- paste0('V',1:m)
#------------- it sounds like you are doing something like this --------------
frm <- as.formula( paste('~', paste(names(df), collapse='+') ) )
library('Matrix')
microbenchmark(
mm <- sparse.model.matrix(frm, df)
) # mean = .133 sec (YMMV)
#---------------- you could try something like this --------------------------
library('caret')
microbenchmark(
mm2 <- dummyVars(frm, df, fullRank=TRUE)
) # mean = .00954 sec (YMMV)
Note fullRank = TRUE
so that "factors are encoded to be consistent with model.matrix
and the resulting there [sic] are no linear dependencies induced between the columns," per ?dummyVars
. You might want to remove fullRank = TRUE
to induce the behavior of sparse=TRUE
in contr.ltrf
as in sparse.model.matrix
. I could not find clear documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With