> system.time(expand.grid(1:1000,1:10000))
user system elapsed
1.65 0.34 2.03
> system.time(CJ(1:1000,1:10000))
user system elapsed
3.48 0.32 3.79
Thanks for reporting this. This has been fixed now in data.table 1.8.9. Here's the timing test with the latest commit (913):
system.time(expand.grid(1:1000,1:10000))
# user system elapsed
# 1.420 0.552 1.987
system.time(CJ(1:1000,1:10000))
# user system elapsed
# 0.080 0.092 0.171
From NEWS :
CJ() is 90% faster on 1e6 rows (for example), #4849. The inputs are now sorted first before combining rather than after combining and uses rep.int instead of rep (thanks to Sean Garborg for the ideas, code and benchmark) and only sorted if is.unsorted(), #2321.
Also check out NEWS for other notable features that have made it in and bug fixes; e.g., CJ()
gains a new sorted
argument too.
Mnel's guess is right. CJ returns a data.table where each column is a key.
> DT <- CJ(1:100,1:100)
> key(DT)
[1] "V1" "V2"
A fairer comparison:
> system.time(CJ(1:1000,1:10000))
user system elapsed
3.40 0.25 3.73
> system.time(data.table(expand.grid(1:1000,1:10000),key=c("Var1","Var2")))
user system elapsed
4.14 0.68 4.90
The real reason it's slow is that it's not focused on the default use of CJ: when the argument vectors satisfy anyDuplicated(vector) == F
.
Maybe others use CJ differently, but for unique vectors, there's room for improvement:
Unit: milliseconds
expr min lq median uq max neval
dt1 <- CJ(a, b, c) 2394.38929 2434.75660 2439.14362 2444.66607 2686.41990 100
dt2 <- fastCJ(a, b, c) 18.83701 25.33339 25.51254 25.70966 27.60622 100
Output identical: TRUE
library(microbenchmark)
library(data.table)
repTE <- function(x, times, each) {
rep.int(rep.int(x, times=rep.int(each, times=length(x))), times=times)
}
fastCJ <- function(...) {
l <- lapply(list(...), sort.int, method="quick")
seq_ct <- length(l)
if (seq_ct > 1) {
seq_lens <- vapply(l, length, numeric(1))
tot_len <- prod(seq_lens)
l <-lapply(
seq_len(seq_ct),
function(i) {
if (i==1) {
len <- seq_lens[1]
rep.int(l[[1]], times=rep.int(tot_len/len, len))
} else if (i < seq_ct) {
pre_len <- prod(seq_lens[1:(i - 1)])
repTE(l[[i]], times=pre_len, each=tot_len/pre_len/seq_lens[i])
} else {
rep.int(l[[seq_ct]], times=tot_len/seq_lens[seq_ct])
}
}
)
} else {
tot_len <- length(l[[1]])
}
setattr(l, "row.names", .set_row_names(tot_len))
setattr(l, "class", c("data.table", "data.frame"))
if (is.null(names <- names(seq_list))) {
names <- vector("character", seq_ct)
}
if (any(tt <- names == "")) {
names[tt] <- paste0("V", which(tt))
}
setattr(l, "names", names)
data.table:::settruelength(l, 0L)
l <- alloc.col(l)
setattr(l, "sorted", names(l))
return(l)
}
a <- factor(sample(1:1000, 1000))
b <- sample(letters, 26)
c <- runif(100)
print(microbenchmark( dt1 <- CJ(a, b, c), dt2 <- fastCJ(a, b, c)))
cat("Output identical:", identical(dt1, dt2))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With