So, I have many data.tables I wish to combine into a single data.table with no duplicate rows.
The 'naive' way to do this is to wrap an rbind call with unique: unique(do.call(rbind, list.of.tables))
This certainly works, but its pretty slow. In my real-world case the tables have two columns; a hash string and size. At this point in the code, they are un-keyed. I have played around with keying by hash first, but the gain in combining is offset by the time to key.
Here's how I benchmarked those options:
require(data.table)
makeHash <- function(numberOfHashes) {
hashspace <- c(0:9, sapply(97:122, function(x) rawToChar(as.raw(x))))
replicate(numberOfHashes, paste(sample(hashspace, 16), collapse=""))
}
mergeNoKey <- function(tableLength, modCount=tableLength/2) {
A <- B <- data.table(hash=makeHash(tableLength), size=sample(1:(1024^2), tableLength))
A[1:modCount] <- data.table(hash=makeHash(modCount), size=sample(1:(1024^2), modCount))
C <- unique(rbind(A,B))
}
mergeWithKey <- function(tableLength, modCount=tableLength/2) {
A <- B <- data.table(hash=makeHash(tableLength), size=sample(1:(1024^2), tableLength))
A[1:modCount] <- data.table(hash=makeHash(modCount), size=sample(1:(1024^2), modCount))
setkey(A, hash)
setkey(B, hash)
C <- unique(rbind(A,B))
}
require(microbenchmark)
m <- microbenchmark(mergeNoKey(1000), mergeWithKey(1000), times=10)
plot(m)
I've played around with tableLength and times and seen no big difference in performance. I feel like there HAS to be a more data.table-ish way to do this.
In practice I need to do this with many data.tables, not two, so scalability is very important; I just wanted to keep the above code simple.
Thanks in advance!
I think you want to use rbindlist
and unique.data.table
...
C <- unique( rbindlist( list( A , B ) ) )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With