I need to compute a similarity measure call the Dice coefficient over large matrices (600,000 x 500) of binary vectors in R. For speed I use C / Rcpp. The function runs great but as I am not a computer scientist by background I would like to know if it could run faster. This code is suitable for parallelisation but I have no experience parallelising C code.
The Dice coefficient is a simple measure of similarity / dissimilarity (depending how you take it). It is intended to compare asymmetric binary vectors, meaning one of the combination (usually 0-0) is not important and agreement (1-1 pairs) have more weight than disagreement (1-0 or 0-1 pairs). Imagine the following contingency table:
1 0
1 a b
0 c d
The Dice coef is: (2*a) / (2*a +b + c)
Here is my Rcpp implementation:
library(Rcpp)
cppFunction('
NumericMatrix dice(NumericMatrix binaryMat){
int nrows = binaryMat.nrow(), ncols = binaryMat.ncol();
NumericMatrix results(ncols, ncols);
for(int i=0; i < ncols-1; i++){ // columns fixed
for(int j=i+1; j < ncols; j++){ // columns moving
double a = 0;
double d = 0;
for (int l = 0; l < nrows; l++) {
if(binaryMat(l, i)>0){
if(binaryMat(l, j)>0){
a++;
}
}else{
if(binaryMat(l, j)<1){
d++;
}
}
}
// compute Dice coefficient
double abc = nrows - d;
double bc = abc - a;
results(j,i) = (2*a) / (2*a + bc);
}
}
return wrap(results);
}
')
And here is a running example:
x <- rbinom(1:200000, 1, 0.5)
X <- matrix(x, nrow = 200, ncol = 1000)
system.time(dice(X))
user system elapsed
0.814 0.000 0.814
The solution proposed by Roland was not entirely satisfying for my use case. So based on the source code from the arules
package I implement a much faster version. The code in arules
rely on an algorithm from Leisch (2005) using the tcrossproduct()
function in R.
First, I wrote a Rcpp / RcppEigen version of crossprod
that is 2-3 time faster. This is based on the example code in the RcppEigen vignette.
library(Rcpp)
library(RcppEigen)
library(inline)
crossprodCpp <- '
using Eigen::Map;
using Eigen::MatrixXi;
using Eigen::Lower;
const Map<MatrixXi> A(as<Map<MatrixXi> >(AA));
const int m(A.rows()), n(A.cols());
MatrixXi AtA(MatrixXi(n, n).setZero().selfadjointView<Lower>().rankUpdate(A.adjoint()));
return wrap(AtA);
'
fcprd <- cxxfunction(signature(AA = "matrix"), crossprodCpp, "RcppEigen")
Then I wrote a small R function to compute the Dice coefficient.
diceR <- function(X){
a <- fcprd(X)
nx <- ncol(X)
rsx <- colSums(X)
c <- matrix(rsx, nrow = nx, ncol = nx) - a
# b <- matrix(rsx, nrow = nx, ncol = nx, byrow = TRUE) - a
b <- t(c)
m <- (2 * a) / (2*a + b + c)
return(m)
}
This new function is ~8 time faster than the old one and ~3 time faster than the one in arules
.
m <- microbenchmark(dice(X), diceR(X), dissimilarity(t(X), method="dice"), times=100)
m
# Unit: milliseconds
# expr min lq median uq max neval
# dice(X) 791.34558 809.8396 812.19480 814.6735 910.1635 100
# diceR(X) 62.98642 76.5510 92.02528 159.2557 507.1662 100
# dissimilarity(t(X), method = "dice") 264.07997 342.0484 352.59870 357.4632 520.0492 100
I cannot run your function at work, but is the result the same as this?
library(arules)
plot(dissimilarity(X,method="dice"))
system.time(dissimilarity(X,method="dice"))
#user system elapsed
#0.04 0.00 0.04
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With