Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interleaving results from many objects in Rcpp

Tags:

r

rcpp

I need to write to a file row by row of matrices and sparse matrices that appears in a list and I am doing something like this:

#include <RcppArmadillo.h>

// [[Rcpp::export]]
bool write_rows (Rcpp::List data, Rcpp::CharacterVector clss, int n) {

  int len = data.length();
  for(int i = 0; i<n; i++) {
    for(int j=0; j<len; j++) {
      if (clss[j] == "matrix") {
        Rcpp::NumericMatrix x = data[j];
        auto row = x.row(i);
        // do something with row i
      } else if (clss[j] == "dgCMatrix") {
        arma::sp_mat x = data[j];
        auto row = x.row(i);
        // do something different with row i
      }
    }
  }

  return true;
}

This function can be called in R with:

data <- list(
  x = Matrix::rsparsematrix(nrow = 1000, ncol = 1000, density = 0.3), 
  y = matrix(1:10000, nrow = 1000, ncol = 10)
  )

clss <- c("dgCMatrix", "matrix")

write_rows(data, clss, 1000)

The function receives a list of matrices or sparse matrices with the same number of rows and writes those matrices row by row, ie. first writes first rows of all elements in data then the second row of all elements and etc.

My problem is that it seems that this line arma::sp_mat x = data[i]; seems to have a huge impact in performance since it seems that I am implicitly casting the list element data[j] to an Armadillo Sparse Matrix n times.

My question is: is there anyway I could avoid this? Is there a more efficient solution? I tried to find a solution by looking into readr's source code, since they also write list elements row by row, but they also do a cast for each row (in this line for example, but maybe this doesn't impact the performance because they deal with SEXPS?

like image 222
Daniel Falbel Avatar asked Apr 26 '18 03:04

Daniel Falbel


1 Answers

With the clarification, it seems that the result should interleave the rows from each matrix. You can still do this while avoiding multiple conversions.

This is the original code, modified to generate some actual output:

// [[Rcpp::export]]
arma::mat write_rows(Rcpp::List data, Rcpp::CharacterVector clss, int nrows, int ncols) {

    int len = data.length();
    arma::mat result(nrows*len, ncols);

    for (int i = 0, k = 0; i < nrows; i++) {
        for (int j = 0; j < len; j++) {
            arma::rowvec r;

            if (clss[j] == "matrix") {
                Rcpp::NumericMatrix x = data[j];
                r = x.row(i);
            }
            else {
                arma::sp_mat x = data[j];
                r = x.row(i);
            }

            result.row(k++) = r;
        }
    }

    return result;
}

The following code creates a vector of converted objects, and then extracts the rows from each object as required. The conversion is only done once per matrix. I use a struct containing a dense and sparse mat because it's a lot simpler than dealing with unions; and I don't want to drag in boost::variant or require C++17. Since there's only 2 classes we want to deal with, the overhead is minimal.

struct Matrix_types {
    arma::mat m;
    arma::sp_mat M;
};

// [[Rcpp::export]]
arma::mat write_rows2(Rcpp::List data, Rcpp::CharacterVector clss, int nrows, int ncols) {

    const int len = data.length();
    std::vector<Matrix_types> matr(len);
    std::vector<bool> is_dense(len);
    arma::mat result(nrows*len, ncols);

    // populate the structs
    for (int j = 0; j < len; j++) {
        is_dense[j] = (clss[j] == "matrix");
        if (is_dense[j]) {
            matr[j].m = Rcpp::as<arma::mat>(data[j]);
        }
        else {
            matr[j].M = Rcpp::as<arma::sp_mat>(data[j]);
        }
    }

    // populate the result
    for (int i = 0, k = 0; i < nrows; i++) {
        for (int j = 0; j < len; j++, k++) {
            if (is_dense[j]) {
                result.row(k) = matr[j].m.row(i);
            }
            else {
                arma::rowvec r(matr[j].M.row(i));
                result.row(k) = r;
            }
        }
    }

    return result;
}

Running on some test data:

data <- list(
    a=Matrix(1.0, 1000, 1000, sparse=TRUE),
    b=matrix(2.0, 1000, 1000),
    c=Matrix(3.0, 1000, 1000, sparse=TRUE),
    d=matrix(4.0, 1000, 1000)
)

system.time(z <- write_rows(data, sapply(data, class), 1000, 1000))
#   user  system elapsed 
# 185.75   35.04  221.38 

system.time(z2 <- write_rows2(data, sapply(data, class), 1000, 1000))
#   user  system elapsed 
#   4.21    0.05    4.25 

identical(z, z2)
# [1] TRUE
like image 130
Hong Ooi Avatar answered Sep 27 '22 23:09

Hong Ooi