Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should data.table's CJ continue accommodating arguments with duplicate elements?

Tags:

r

data.table

(EDIT: @Arun cleaned and fixed up the code below, and has replaced 'CJ' in data.table on R-Forge. Feature request: #4849; Faster CJ and update on Why is expand.grid faster than data.table 's CJ?)

In my mind CJ is meant to work on argument vectors that satisfy anyDuplicated(vector) == F.

Does anyone use it with non-unique arguments?

If so, is it worth leveling a 100x speed hit on what I consider to be the primary use?

Lower bound of speed improvement by tuning for primary use (lower bound for these sample arguments, not lower bound for any arguments):

Comparison:

Unit: milliseconds
                  expr        min         lq     median         uq      max neval
    dt1 <- CJ(a, b, c) 3149.15293 3166.59638 3204.95956 3472.70826 4414.919   100
dt2 <- fastCJ(a, b, c)   22.85207   23.16003   23.43691   24.04215 1208.855   100
Output identical: TRUE

Code:

library(microbenchmark)
library(data.table)

repTE <- function(x, times, each) {
  rep.int(rep.int(x, times=rep.int(each, times=length(x))), times=times)
}
fastCJ <- function(...) {
  arg_list <- list(...)
  l <- lapply(arg_list, sort.int, method="quick")
  seq_ct <- length(l)
  if (seq_ct > 1) {
    seq_lens <- vapply(l, length, numeric(1))
    tot_len <- prod(seq_lens)

    l <-lapply(
      seq_len(seq_ct),
      function(i) {
        if (i==1) {
          len <- seq_lens[1]
          rep.int(l[[1]], times=rep.int(tot_len/len, len))
        } else if (i < seq_ct) {
          pre_len <- prod(seq_lens[1:(i - 1)])
          repTE(l[[i]], times=pre_len, each=tot_len/pre_len/seq_lens[i])
        } else {
          rep.int(l[[seq_ct]], times=tot_len/seq_lens[seq_ct])
        }
      }
    )    
  } else {
    tot_len <- length(l[[1]])
  }

  setattr(l, "row.names", .set_row_names(tot_len))
  setattr(l, "class", c("data.table", "data.frame"))
  if (is.null(names <- names(arg_list))) {
    names <- vector("character", seq_ct)
  }
  if (any(tt <- names == "")) {
    names[tt] <- paste0("V", which(tt))
  }
  setattr(l, "names", names)
  data.table:::settruelength(l, 0L)
  l <- alloc.col(l)
  setattr(l, "sorted", names(l))

  return(l)
}

a <- factor(sample(1:1000, 1000))
b <- sample(letters, 26)
c <- runif(100)
print(microbenchmark( dt1 <- CJ(a, b, c), dt2 <- fastCJ(a, b, c)))
cat("Output identical:", identical(dt1, dt2))
like image 983
garborg Avatar asked Nov 13 '22 00:11

garborg


1 Answers

For completeness, to add an answer, from NEWS :

CJ() is 90% faster on 1e6 rows (for example), #4849. The inputs are now sorted first before combining rather than after combining and uses rep.int instead of rep (thanks to Sean Garborg for the ideas, code and benchmark) and only sorted if is.unsorted(), #2321.

Reminder: CJ = Cross Join; i.e., joins to all combinations of its inputs.

like image 72
Matt Dowle Avatar answered Nov 15 '22 07:11

Matt Dowle