does the by( ) function make growing list

Question

Does the by function make a list that grows one element at a time?

I need to process a data frame with about 4M observations grouped by a factor column. The situation is similar to the example below:

> # Make 4M rows of data
> x = data.frame(col1=1:4000000, col2=10000001:14000000)
> # Make a factor
> x[,"f"] = x[,"col1"] - x[,"col1"] %% 5
>   
> head(x)
  col1     col2 f
1    1 10000001 0
2    2 10000002 0
3    3 10000003 0
4    4 10000004 0
5    5 10000005 5
6    6 10000006 5

Now, a tapply on one of the columns takes a reasonable amount of time:

> t1 = Sys.time()
> z = tapply(x[, 1], x[, "f"], mean)
> Sys.time() - t1
Time difference of 22.14491 secs

But if I do this:

z = by(x[, 1], x[, "f"], mean)

That doesn't finish anywhere near the same time (I gave up after a minute).

Of course, in the above example, tapply could be used, but I actually need to process multiple columns together. What is the better way to do this?

Ricardo Saporta · Accepted Answer

by is slower than tapply because it is wrapping by. Let's take a look at some benchmarks: tapply in this situation is more than 3x faster than using by

UPDATED to include @Roland's great recomendation:

library(rbenchmark)
library(data.table)
dt <- data.table(x,key="f")

using.tapply <- quote(tapply(x[, 1], x[, "f"], mean))
using.by <- quote(by(x[, 1], x[, "f"], mean))
using.dtable <- quote(dt[,mean(col1),by=key(dt)])

times <- benchmark(using.tapply, using.dtable, using.by, replications=10, order="relative")
times[,c("test", "elapsed", "relative")] 

#------------------------#
#         RESULTS        # 
#------------------------#

#       COMPARING tapply VS by     #
#-----------------------------------
#              test elapsed relative
#   1  using.tapply   2.453    1.000
#   2      using.by   8.889    3.624

#   COMPARING data.table VS tapply VS by   #
#------------------------------------------#
#             test elapsed relative
#   2  using.dtable   0.168    1.000
#   1  using.tapply   2.396   14.262
#   3      using.by   8.566   50.988

If x$f is a factor, the loss in efficiency between tapply and by is even greater!

Although, notice that they both improve relative to non-factor inputs, while data.table remains approx the same or worse

x[, "f"] <- as.factor(x[, "f"])
dt <- data.table(x,key="f")
times <- benchmark(using.tapply, using.dtable, using.by, replications=10, order="relative")
times[,c("test", "elapsed", "relative")] 

#               test elapsed relative
#   2   using.dtable   0.175    1.000
#   1   using.tapply   1.803   10.303
#   3       using.by   7.854   44.880

As for the why, the short answer is in the documentation itself.

?by :

Description

Function by is an object-oriented wrapper for tapply applied to data frames.

let's take a look at the source for by (or more specificaly, by.data.frame):

by.data.frame
function (data, INDICES, FUN, ..., simplify = TRUE) 
{
    if (!is.list(INDICES)) {
        IND <- vector("list", 1L)
        IND[[1L]] <- INDICES
        names(IND) <- deparse(substitute(INDICES))[1L]
    }
    else IND <- INDICES
    FUNx <- function(x) FUN(data[x, , drop = FALSE], ...)
    nd <- nrow(data)
    ans <- eval(substitute(tapply(seq_len(nd), IND, FUNx, simplify = simplify)), 
        data)
    attr(ans, "call") <- match.call()
    class(ans) <- "by"
    ans
}

We see immediately that there is still a call to tapply plus a lot of extras (including calls to deparse(substitute(.)) and an eval(substitute(.)) both of which are relatively slow). Therefore it makes sense that your tapply will be relatively faster than a similar call to by.

Roland · Answer

Regarding a better way to do this: With 4M rows you should use data.table.

library(data.table)
dt <- data.table(x,key="f")
dt[,mean(col1),by=key(dt)]

dt[,list(mean1=mean(col1),mean2=mean(col2)),by=key(dt)]
dt[,lapply(.SD,mean),by=key(dt)]

does the by( ) function make growing list

Tags:

r

benchmarking

tapply

Anand

2 Answers

If x$f is a factor, the loss in efficiency between tapply and by is even greater!

As for the why, the short answer is in the documentation itself.

Ricardo Saporta

Roland

Recent Activity

Donate For Us

does the by( ) function make growing list

Tags:

r

benchmarking

tapply

Anand

2 Answers

If x$f is a factor, the loss in efficiency between tapply and by is even greater!

As for the why, the short answer is in the documentation itself.

Ricardo Saporta

Roland

Related questions

Recent Activity

Donate For Us