Generating Indicators in large data frames

Question

The objective is to create indicators for a factor/string variable in a data frame. That dataframe has > 2mm rows, and running R on windows, I don't have the option of using plyr with .parallel=T. So I'm taking the "divide and conquer" route with plyr and reshape2.

Running melt and cast runs out of memory, and using

ddply( idata.frame(items) , c("ID") , function(x){
       (    colSums( model.matrix( ~ x$element - 1) ) > 0   )
} , .progress="text" )

or

ddply( idata.frame(items) , c("ID") , function(x){
           (    elements %in% x$element   )
    } , .progress="text" )

does take a while. The fastest approach is the call to tapply below. Do you see a way to speed this up? The %in% statement runs faster than the model.matrix call. Thanks.

set.seed(123)

dd <- data.frame(
  id  = sample( 1:5, size=10 , replace=T ) ,
  prd = letters[sample( 1:5, size=10 , replace=T )]
  )

prds <- unique(dd$prd)

tapply( dd$prd , dd$id , function(x) prds %in% x )

BenBarnes · Accepted Answer

For this problem, the packages bigmemory and bigtabulate might be your friends. Here is a slightly more ambitious example:

library(bigmemory)
library(bigtabulate)

set.seed(123)

dd <- data.frame(
  id = sample( 1:15, size=2e6 , replace=T ), 
  prd = letters[sample( 1:15, size=2e6 , replace=T )]
  )

prds <- unique(dd$prd)

benchmark(
bigtable(dd,c(1,2))>0,
table(dd[,1],dd[,2])>0,
xtabs(~id+prd,data=dd)>0,
tapply( dd$prd , dd$id , function(x) prds %in% x )
)

And the results of benchmarking (I'm learning new things all the time):

                                            test replications elapsed relative user.self sys.self user.child sys.child
1                      bigtable(dd, c(1, 2)) > 0          100  54.401 1.000000    51.759    3.817          0         0
2                    table(dd[, 1], dd[, 2]) > 0          100 112.361 2.065422   107.526    6.614          0         0
4 tapply(dd$prd, dd$id, function(x) prds %in% x)          100 178.308 3.277660   166.544   13.275          0         0
3                xtabs(~id + prd, data = dd) > 0          100 229.435 4.217478   217.014   16.660          0         0

And that shows bigtable winning out by a considerable amount. The results are pretty much that all prds are in all IDs, but see ?bigtable for details on the format of the results.

Generating Indicators in large data frames

Tags:

memory

dataframe

r

plyr

reshape2

M.Dimo

1 Answers

BenBarnes

Recent Activity

Donate For Us

Generating Indicators in large data frames

Tags:

memory

dataframe

r

plyr

reshape2

M.Dimo

1 Answers

BenBarnes

Related questions

Recent Activity

Donate For Us