Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently building a large (200 MM line) dataframe

Tags:

r

I am attempting to build a large (~200 MM line) dataframe in R. Each entry in the dataframe will consist of approximately 10 digits (e.g. 1234.12345). The code is designed to walk through a list, subtract an item in position [i] from every item after [i], but not the items before [i] (If I was putting the output into a matrix it would be a triangular matrix). The code is simple and works fine on smaller lists, but I am wondering if there is a faster or more efficient way to do this? I assume the first part of the answer is going to entail "don't use a nested for loop," but I am not sure what the alternatives are.

The idea is that this will be an "edge list" for a social network analysis graph. Once I have 'outlist' I will reduce the number of edges based on some criteria(<,>,==,) so the final list (and graph) won't be quite so ponderous.

#Fake data of same approximate dimensions as real data
dlist<-sample(1:20,20, replace=FALSE) 
#purge the output list before running the loop
rm(outlist)
outlist<-data.frame()

for(i in 1:(length(dlist)-1)){
         for(j in (i+1):length(dlist)){

             outlist<-rbind(outlist, c(dlist[i],dlist[j], dlist[j]-dlist[i]))

     }
    }
like image 438
TBP Avatar asked Dec 25 '22 07:12

TBP


1 Answers

IIUC your final dataset will be ~200 million rows by 3 columns, all of type numeric, which takes a total space of:

200e6 (rows) * 3 (cols) * 8 (bytes) / (1024 ^ 3)
# ~ 4.5GB

That's quite a big data, where it's essential to avoid copies wherever possible.

Here's a method that uses data.table package's unexported (internal) vecseq function (written in C and is fast + memory efficient) and makes use of it's assignment by reference operator :=, to avoid copies.

fn1 <- function(x) {
    require(data.table) ## 1.9.2
    lx = length(x)
    vx = as.integer(lx * (lx-1)/2)
    # R v3.1.0 doesn't copy on doing list(.) - so should be even more faster there
    ans = setDT(list(v1 = rep.int(head(x,-1L), (lx-1L):1L), 
                v2=x[data.table:::vecseq(2:lx, (lx-1L):1, vx)]))
    ans[, v3 := v2-v1]
}

Benchmarking:

I'll benchmark with functions from other answers on your data dimensions. Note that my benchmark is on R v3.0.2, but fn1() should give better performance (both speed and memory) on R v3.1.0 because list(.) doesn't result in copy anymore.

fn2 <- function(x) {
    diffmat <- outer(x, x, "-")
    ss <- which(upper.tri(diffmat), arr.ind = TRUE)
    data.frame(v1 = x[ss[,1]], v2 = x[ss[,2]], v3 = diffmat[ss])
}

fn3 <- function(x) {
    idx <- combn(seq_along(x), 2)
    out2 <- data.frame(v1=x[idx[1, ]], v2=x[idx[2, ]])
    out2$v3 <- out2$v2-out2$v1
    out2
}

set.seed(45L)
x = runif(20e3L)
system.time(ans1 <- fn1(x))  ## 18  seconds + ~8GB  (peak) memory usage
system.time(ans2 <- fn2(x))  ## 158 seconds + ~19GB (peak) memory usage
system.time(ans3 <- fn3(x))  ## 809 seconds + ~12GB (peak) memory usage

Note that fn2() due to use of outer requires quite a lot of memory (peak memory usage was >=19GB) and is slower than fn1(). fn3() is just very very slow (due to combn, and unnecessary copy).

like image 169
Arun Avatar answered Jan 12 '23 00:01

Arun