(EDIT: @Arun cleaned and fixed up the code below, and has replaced 'CJ' in data.table on R-Forge. Feature request: #4849; Faster CJ and update on Why is expand.grid faster than data.table 's CJ?)
In my mind CJ
is meant to work on argument vectors that satisfy anyDuplicated(vector) == F
.
Does anyone use it with non-unique arguments?
If so, is it worth leveling a 100x speed hit on what I consider to be the primary use?
Lower bound of speed improvement by tuning for primary use (lower bound for these sample arguments, not lower bound for any arguments):
Unit: milliseconds
expr min lq median uq max neval
dt1 <- CJ(a, b, c) 3149.15293 3166.59638 3204.95956 3472.70826 4414.919 100
dt2 <- fastCJ(a, b, c) 22.85207 23.16003 23.43691 24.04215 1208.855 100
Output identical: TRUE
library(microbenchmark)
library(data.table)
repTE <- function(x, times, each) {
rep.int(rep.int(x, times=rep.int(each, times=length(x))), times=times)
}
fastCJ <- function(...) {
arg_list <- list(...)
l <- lapply(arg_list, sort.int, method="quick")
seq_ct <- length(l)
if (seq_ct > 1) {
seq_lens <- vapply(l, length, numeric(1))
tot_len <- prod(seq_lens)
l <-lapply(
seq_len(seq_ct),
function(i) {
if (i==1) {
len <- seq_lens[1]
rep.int(l[[1]], times=rep.int(tot_len/len, len))
} else if (i < seq_ct) {
pre_len <- prod(seq_lens[1:(i - 1)])
repTE(l[[i]], times=pre_len, each=tot_len/pre_len/seq_lens[i])
} else {
rep.int(l[[seq_ct]], times=tot_len/seq_lens[seq_ct])
}
}
)
} else {
tot_len <- length(l[[1]])
}
setattr(l, "row.names", .set_row_names(tot_len))
setattr(l, "class", c("data.table", "data.frame"))
if (is.null(names <- names(arg_list))) {
names <- vector("character", seq_ct)
}
if (any(tt <- names == "")) {
names[tt] <- paste0("V", which(tt))
}
setattr(l, "names", names)
data.table:::settruelength(l, 0L)
l <- alloc.col(l)
setattr(l, "sorted", names(l))
return(l)
}
a <- factor(sample(1:1000, 1000))
b <- sample(letters, 26)
c <- runif(100)
print(microbenchmark( dt1 <- CJ(a, b, c), dt2 <- fastCJ(a, b, c)))
cat("Output identical:", identical(dt1, dt2))
For completeness, to add an answer, from NEWS :
CJ() is 90% faster on 1e6 rows (for example), #4849. The inputs are now sorted first before combining rather than after combining and uses rep.int instead of rep (thanks to Sean Garborg for the ideas, code and benchmark) and only sorted if is.unsorted(), #2321.
Reminder: CJ = Cross Join; i.e., joins to all combinations of its inputs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With