Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacement for unique(rbind()) when using data.tables

Tags:

r

data.table

So, I have many data.tables I wish to combine into a single data.table with no duplicate rows. The 'naive' way to do this is to wrap an rbind call with unique: unique(do.call(rbind, list.of.tables))

This certainly works, but its pretty slow. In my real-world case the tables have two columns; a hash string and size. At this point in the code, they are un-keyed. I have played around with keying by hash first, but the gain in combining is offset by the time to key.

Here's how I benchmarked those options:

require(data.table)

makeHash <- function(numberOfHashes) {

  hashspace <- c(0:9, sapply(97:122, function(x) rawToChar(as.raw(x))))
  replicate(numberOfHashes, paste(sample(hashspace, 16), collapse=""))

}

mergeNoKey <- function(tableLength, modCount=tableLength/2) {

  A <- B <- data.table(hash=makeHash(tableLength), size=sample(1:(1024^2), tableLength))

  A[1:modCount] <- data.table(hash=makeHash(modCount), size=sample(1:(1024^2), modCount))

  C <- unique(rbind(A,B))
}

mergeWithKey <- function(tableLength, modCount=tableLength/2) {

  A <- B <- data.table(hash=makeHash(tableLength), size=sample(1:(1024^2), tableLength))

  A[1:modCount] <- data.table(hash=makeHash(modCount), size=sample(1:(1024^2), modCount))

  setkey(A, hash)
  setkey(B, hash)

  C <- unique(rbind(A,B))
}

require(microbenchmark)
m <- microbenchmark(mergeNoKey(1000), mergeWithKey(1000), times=10)
plot(m)

I've played around with tableLength and times and seen no big difference in performance. I feel like there HAS to be a more data.table-ish way to do this.

In practice I need to do this with many data.tables, not two, so scalability is very important; I just wanted to keep the above code simple.

Thanks in advance!

like image 452
ClaytonJY Avatar asked Sep 06 '13 19:09

ClaytonJY


1 Answers

I think you want to use rbindlist and unique.data.table...

C <- unique( rbindlist( list( A , B ) ) )
like image 179
Simon O'Hanlon Avatar answered Sep 20 '22 21:09

Simon O'Hanlon