Several matrices with 2 columns each need to be combined as shown below
matrix1
1,3
1,5
3,6
matrix2
1,4
1,5
3,6
3,7
output
1,3,1
1,4,1
1,5,2
3,6,2
3,7,1
The third column in the output is the count of how many times a pair has been seen in all the matrices. I wrote some code to do this
require(data.table)
set.seed(1000)
data.lst <- lapply(1:200, function(n) { x <- matrix(sample(1:1000,2000,replace=T), ncol=2); x[!duplicated(x),] })
#method 1
pair1.dt <- data.table(i=integer(0), j=integer(0), cnt=integer(0))
for(mat in data.lst) {
pair1.dt <- rbind(pair1.dt, data.table(i=mat[,1],j=mat[,2],cnt=1))[, .(cnt=sum(cnt)), .(i,j)]
}
#method 2
pair2.dt <- data.table(i=integer(0), j=integer(0), cnt=integer(0))
for(mat in data.lst) {
pair2.dt <- merge(pair2.dt, data.table(i=mat[,1],j=mat[,2],cnt=1), by=c("i","j"), all=T)[,
cnt:=rowSums(.SD,na.rm=T), .SDcols=c("cnt.x","cnt.y")][, c("cnt.x","cnt.y"):=NULL]
}
cat(sprintf("num.rows => pair1: %d, pair2: %d", pair1.dt[,.N], pair2.dt[,.N]), "\n")
In the real problem, each of the matrices have 10s of millions of rows and there may be 30-40% overlap. I am trying to figure out the fastest way to do this. I tried using Matrix::sparseMatrix. While that is much faster, I ran into an error "long vectors not supported yet". I have a couple of different data.table based approaches here. I am looking for suggestions to speed up this code and/or suggest alternative approaches.
First, make data.tables of them:
dt.lst = lapply(data.lst, as.data.table)
Stacking. For comparison, here's the fast way that involves stacking:
res0 = rbindlist(dt.lst)[, .(n = .N), by=V1:V2]
The OP has said this is not feasible, since the intermediate result made by rbindlist
would be too large.
Enumerating first. With a small range of values, I'd suggest enumerating them all up front:
res1 = CJ(V1 = 1:1000, V2 = 1:1000)[, n := 0L]
for (k in seq_along(dt.lst)) res1[ dt.lst[[k]], n := n + .N, by=.EACHI ]
fsetequal(res0, res1[n>0]) # TRUE
The OP has indicated that there are 1e12 possible values, so this doesn't seem like a good idea. Instead, we can use
res2 = dt.lst[[1L]][0L]
for (k in seq_along(dt.lst)) res2 = funion(res2, dt.lst[[k]])
res2[, n := 0L]
setkey(res2, V1, V2)
for (k in seq_along(dt.lst)) res2[ dt.lst[[k]], n := n + .N, by=.EACHI ]
fsetequal(res0, res2) # TRUE
This is the slowest of the three options for the example given, but seems best to me in light of the OP's concerns.
Growing inside a loop. Finally...
res3 = dt.lst[[1L]][0L][, n := NA_integer_][]
for (k in seq_along(dt.lst)){
setkey(res3, V1, V2)
res3[dt.lst[[k]], n := n + .N, by=.EACHI ]
res3 = rbind(
res3,
fsetdiff(dt.lst[[k]], res3[, !"n", with=FALSE], all=TRUE)[, .(n = .N), by=V1:V2]
)
}
fsetequal(res0, res3) # TRUE
Growing objects inside a loop is strongly discouraged and inefficient in R, but this allows you to do it in a single loop instead of two.
Other options and notes. I suspect that you'd be best of using a hash. Those are available in the hash package and probably also through the Rcpp package.
fsetequal
, fsetdiff
and funion
are recent additions to the development version of the package. Find details on the data.table project's official site.
By the way, if entries within each matrix are distinct, you can replace .N
with 1L
everywhere above and drop by=.EACHI
and all=TRUE
.
Using Rcpp. This method will take advantage of the hashing property of std::unordered_map.
#include "Rcpp.h"
#include <stdint.h>
#include <unordered_map>
using namespace std;
using namespace Rcpp;
//[[Rcpp::export]]
Rcpp::XPtr<int> CreateMap(){
std::unordered_map<int64_t, int>* myMap = new std::unordered_map<int64_t, int>();
Rcpp::XPtr<int> p((int*)myMap,false);
return p;
}
//[[Rcpp::export]]
void FreeMap(Rcpp::XPtr<int> map_ptr){
std::unordered_map<int64_t, int>* myMap = (std::unordered_map<int64_t, int>*)(int*)map_ptr;
delete myMap;
}
//[[Rcpp::export]]
void AccumulateValues(Rcpp::XPtr<int> map_ptr, SEXP mat){
NumericMatrix m(mat);
std::unordered_map<int64_t, int>* myMap = (std::unordered_map<int64_t, int>*)(int*)map_ptr;
for(int i = 0; i<m.nrow(); i++){
int c1 = m(i, 0);
int c2 = m(i, 1);
int64_t key = ((int64_t)c1 << 32) + c2;
(*myMap)[key] ++;
}
}
//[[Rcpp::export]]
SEXP AsMatrix(Rcpp::XPtr<int> map_ptr){
std::unordered_map<int64_t, int>* myMap = (std::unordered_map<int64_t, int>*)(int*)map_ptr;
NumericMatrix m(myMap->size(),3);
int index = 0;
for ( auto it = myMap->begin(); it != myMap->end(); ++it ){
int64_t key = it->first;
m(index, 0) = (int)(key >> 32);
m(index, 1) = (int)key;
m(index, 2) = it->second;
index++;
}
return m;
}
the R code is then:
myMap<-CreateMap()
AccumulateValues(myMap, matrix1)
AccumulateValues(myMap, matrix2)
result<-AsMatrix(myMap)
FreeMap(myMap)
also requires
PKG_CXXFLAGS = "-std=c++0x"
in the package makevars
Perhaps you can process your data in batches, as your memory allows:
maxRows = 5000 # approximately how many rows can you hold in memory
tmp.lst = list()
nrows = 0
idx = 1
for (i in seq_along(data.lst)) {
tmp.lst[[idx]] = as.data.table(data.lst[[i]])[, cnt := 1]
idx = idx + 1
nrows = nrows + nrow(data.lst[[i]])
# if too many rows, collapse (can also replace by some memory condition)
if (nrows > maxRows) {
tmp.lst = list(rbindlist(tmp.lst)[, .(cnt = sum(cnt)), by = V1:V2])
idx = 2
nrows = nrow(tmp.lst[[1]])
}
}
#final collapse
res = rbindlist(tmp.lst)[, .(cnt = sum(cnt)), by = V1:V2]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With