Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

repeated joins on huge datasets

Tags:

r

data.table

Several matrices with 2 columns each need to be combined as shown below

matrix1
1,3
1,5
3,6

matrix2
1,4
1,5
3,6
3,7

output
1,3,1
1,4,1
1,5,2
3,6,2
3,7,1

The third column in the output is the count of how many times a pair has been seen in all the matrices. I wrote some code to do this

require(data.table)

set.seed(1000)
data.lst <- lapply(1:200, function(n) { x <- matrix(sample(1:1000,2000,replace=T), ncol=2); x[!duplicated(x),] })

#method 1
pair1.dt <- data.table(i=integer(0), j=integer(0), cnt=integer(0))
for(mat in data.lst) {
    pair1.dt <- rbind(pair1.dt, data.table(i=mat[,1],j=mat[,2],cnt=1))[, .(cnt=sum(cnt)), .(i,j)]
}

#method 2
pair2.dt <- data.table(i=integer(0), j=integer(0), cnt=integer(0))
for(mat in data.lst) {
    pair2.dt <- merge(pair2.dt, data.table(i=mat[,1],j=mat[,2],cnt=1), by=c("i","j"), all=T)[, 
        cnt:=rowSums(.SD,na.rm=T), .SDcols=c("cnt.x","cnt.y")][, c("cnt.x","cnt.y"):=NULL]
}

cat(sprintf("num.rows  =>  pair1: %d,  pair2: %d", pair1.dt[,.N], pair2.dt[,.N]), "\n")

In the real problem, each of the matrices have 10s of millions of rows and there may be 30-40% overlap. I am trying to figure out the fastest way to do this. I tried using Matrix::sparseMatrix. While that is much faster, I ran into an error "long vectors not supported yet". I have a couple of different data.table based approaches here. I am looking for suggestions to speed up this code and/or suggest alternative approaches.

like image 602
ironv Avatar asked Apr 18 '16 19:04

ironv


Video Answer


3 Answers

First, make data.tables of them:

dt.lst = lapply(data.lst, as.data.table)

Stacking. For comparison, here's the fast way that involves stacking:

res0 = rbindlist(dt.lst)[, .(n = .N), by=V1:V2]

The OP has said this is not feasible, since the intermediate result made by rbindlist would be too large.

Enumerating first. With a small range of values, I'd suggest enumerating them all up front:

res1 = CJ(V1 = 1:1000, V2 = 1:1000)[, n := 0L]
for (k in seq_along(dt.lst)) res1[ dt.lst[[k]], n := n + .N, by=.EACHI ] 

fsetequal(res0, res1[n>0]) # TRUE

The OP has indicated that there are 1e12 possible values, so this doesn't seem like a good idea. Instead, we can use

res2 = dt.lst[[1L]][0L]
for (k in seq_along(dt.lst)) res2 = funion(res2, dt.lst[[k]])
res2[, n := 0L]
setkey(res2, V1, V2)
for (k in seq_along(dt.lst)) res2[ dt.lst[[k]], n := n + .N, by=.EACHI ]     

fsetequal(res0, res2) # TRUE

This is the slowest of the three options for the example given, but seems best to me in light of the OP's concerns.

Growing inside a loop. Finally...

res3 = dt.lst[[1L]][0L][, n := NA_integer_][]
for (k in seq_along(dt.lst)){
  setkey(res3, V1, V2)
  res3[dt.lst[[k]], n := n + .N, by=.EACHI ]
  res3 = rbind(
    res3, 
    fsetdiff(dt.lst[[k]], res3[, !"n", with=FALSE], all=TRUE)[, .(n = .N), by=V1:V2]
  )
} 

fsetequal(res0, res3) # TRUE

Growing objects inside a loop is strongly discouraged and inefficient in R, but this allows you to do it in a single loop instead of two.

Other options and notes. I suspect that you'd be best of using a hash. Those are available in the hash package and probably also through the Rcpp package.

fsetequal, fsetdiff and funion are recent additions to the development version of the package. Find details on the data.table project's official site.

By the way, if entries within each matrix are distinct, you can replace .N with 1L everywhere above and drop by=.EACHI and all=TRUE.

like image 87
Frank Avatar answered Oct 25 '22 16:10

Frank


Using Rcpp. This method will take advantage of the hashing property of std::unordered_map.

#include "Rcpp.h"
#include <stdint.h>
#include <unordered_map>

using namespace std;
using namespace Rcpp;

//[[Rcpp::export]]
Rcpp::XPtr<int> CreateMap(){
  std::unordered_map<int64_t, int>* myMap = new std::unordered_map<int64_t, int>();
  Rcpp::XPtr<int> p((int*)myMap,false);
  return p;
}

//[[Rcpp::export]]
void FreeMap(Rcpp::XPtr<int> map_ptr){
  std::unordered_map<int64_t, int>* myMap =  (std::unordered_map<int64_t, int>*)(int*)map_ptr;
  delete myMap;
}

//[[Rcpp::export]]
void AccumulateValues(Rcpp::XPtr<int> map_ptr, SEXP mat){

  NumericMatrix m(mat);

  std::unordered_map<int64_t, int>* myMap =  (std::unordered_map<int64_t, int>*)(int*)map_ptr;
  for(int i = 0; i<m.nrow(); i++){
    int c1 = m(i, 0);
    int c2 = m(i, 1);
    int64_t key = ((int64_t)c1 << 32) + c2;
    (*myMap)[key] ++;

  }
}
//[[Rcpp::export]]
SEXP AsMatrix(Rcpp::XPtr<int> map_ptr){
  std::unordered_map<int64_t, int>* myMap =  (std::unordered_map<int64_t, int>*)(int*)map_ptr;
  NumericMatrix m(myMap->size(),3);
  int index = 0;
  for ( auto it = myMap->begin(); it != myMap->end(); ++it ){
    int64_t key = it->first;
    m(index, 0) = (int)(key >> 32);
    m(index, 1) = (int)key;
    m(index, 2) = it->second;
    index++;
  }
  return m;
}

the R code is then:

myMap<-CreateMap()
AccumulateValues(myMap, matrix1)
AccumulateValues(myMap, matrix2)
result<-AsMatrix(myMap)
FreeMap(myMap)

also requires

PKG_CXXFLAGS = "-std=c++0x"

in the package makevars

like image 33
Scott Morken Avatar answered Oct 25 '22 16:10

Scott Morken


Perhaps you can process your data in batches, as your memory allows:

maxRows = 5000 # approximately how many rows can you hold in memory
tmp.lst = list()
nrows = 0
idx = 1
for (i in seq_along(data.lst)) {
  tmp.lst[[idx]] = as.data.table(data.lst[[i]])[, cnt := 1]
  idx = idx + 1
  nrows = nrows + nrow(data.lst[[i]])

  # if too many rows, collapse (can also replace by some memory condition)
  if (nrows > maxRows) {
    tmp.lst = list(rbindlist(tmp.lst)[, .(cnt = sum(cnt)), by = V1:V2])
    idx = 2
    nrows = nrow(tmp.lst[[1]])
  }
}

#final collapse
res = rbindlist(tmp.lst)[, .(cnt = sum(cnt)), by = V1:V2]
like image 43
eddi Avatar answered Oct 25 '22 15:10

eddi