Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating Indicators in large data frames

The objective is to create indicators for a factor/string variable in a data frame. That dataframe has > 2mm rows, and running R on windows, I don't have the option of using plyr with .parallel=T. So I'm taking the "divide and conquer" route with plyr and reshape2.

Running melt and cast runs out of memory, and using

ddply( idata.frame(items) , c("ID") , function(x){
       (    colSums( model.matrix( ~ x$element - 1) ) > 0   )
} , .progress="text" )    

or

ddply( idata.frame(items) , c("ID") , function(x){
           (    elements %in% x$element   )
    } , .progress="text" )  

does take a while. The fastest approach is the call to tapply below. Do you see a way to speed this up? The %in% statement runs faster than the model.matrix call. Thanks.

set.seed(123)

dd <- data.frame(
  id  = sample( 1:5, size=10 , replace=T ) ,
  prd = letters[sample( 1:5, size=10 , replace=T )]
  )

prds <- unique(dd$prd)

tapply( dd$prd , dd$id , function(x) prds %in% x )
like image 875
M.Dimo Avatar asked Mar 26 '12 19:03

M.Dimo


1 Answers

For this problem, the packages bigmemory and bigtabulate might be your friends. Here is a slightly more ambitious example:

library(bigmemory)
library(bigtabulate)

set.seed(123)

dd <- data.frame(
  id = sample( 1:15, size=2e6 , replace=T ), 
  prd = letters[sample( 1:15, size=2e6 , replace=T )]
  )

prds <- unique(dd$prd)

benchmark(
bigtable(dd,c(1,2))>0,
table(dd[,1],dd[,2])>0,
xtabs(~id+prd,data=dd)>0,
tapply( dd$prd , dd$id , function(x) prds %in% x )
)

And the results of benchmarking (I'm learning new things all the time):

                                            test replications elapsed relative user.self sys.self user.child sys.child
1                      bigtable(dd, c(1, 2)) > 0          100  54.401 1.000000    51.759    3.817          0         0
2                    table(dd[, 1], dd[, 2]) > 0          100 112.361 2.065422   107.526    6.614          0         0
4 tapply(dd$prd, dd$id, function(x) prds %in% x)          100 178.308 3.277660   166.544   13.275          0         0
3                xtabs(~id + prd, data = dd) > 0          100 229.435 4.217478   217.014   16.660          0         0

And that shows bigtable winning out by a considerable amount. The results are pretty much that all prds are in all IDs, but see ?bigtable for details on the format of the results.

like image 53
BenBarnes Avatar answered Oct 22 '22 09:10

BenBarnes