Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiency of matrix rowSums() vs. colSums() in R vs Rcpp vs Armadillo

Background

Coming from R programming, I'm in the process of expanding to compiled code in the form of C/C++ with Rcpp. As a hands on exercise on the effect of loop interchange (and just C/C++ in general), I implemented equivalents to R's rowSums() and colSums() functions for matrices with Rcpp (I know these exist as Rcpp sugar and in Armadillo -- this was just an exercise).

Question

I have my C++ implementation of rowSums() and colSums() along with Rcpp sugar and arma::sum() versions in this matsums.cpp file. Mine are just simple loops like this:

NumericVector Cpp_colSums(const NumericMatrix& x) {
  int nr = x.nrow(), nc = x.ncol();
  NumericVector ans(nc);
  for (int j = 0; j < nc; j++) {
    double sum = 0.0;
    for (int i = 0; i < nr; i++) {
      sum += x(i, j);
    }
    ans[j] = sum;
  }
  return ans;
}

NumericVector Cpp_rowSums(const NumericMatrix& x) {
  int nr = x.nrow(), nc = x.ncol();
  NumericVector ans(nr);
  for (int j = 0; j < nc; j++) {
    for (int i = 0; i < nr; i++) {
      ans[i] += x(i, j);
    }
  }
  return ans;
}

(R matrices are stored column-major, so columns in the outer loop should be the more efficient approach. That's what I was testing originally.)

While running benchmarks on these, I ran into something I wasn't expecting: there was a clear performance difference between row sums and col sums (see benchmarks below):

  1. Using the builtin R functions, colSums() is about twice as fast as rowSums().
  2. With my own Rcpp and the sugar version, this is reversed: it is rowSums() that is about twice as fast as colSums().
  3. And finally, adding the Armadillo implementations, the operations are roughly equal (col sum maybe a bit faster, as I would have expected them to be in R, too).

So my primary question is: why is Cpp_rowSums() significantly faster than Cpp_colSums()?

As a secondary interest, I'm also curious why the same difference is reversed in the R implementations. (I skimmed through the C source, but could not really make out the significant differences.) (And third, how come Armadillo gets equal performance?)

Benchmarks

I tested all 4 implementations of both functions on a 10,000 x 10,000 symmetric matrix:

Rcpp::sourceCpp("matsums.cpp")

set.seed(92136)
n <- 1e4 # build n x n test matrix
x <- matrix(rnorm(n), 1, n)
x <- crossprod(x, x) # symmetric

bench::mark(
  rowSums(x),
  colSums(x),
  Cpp_rowSums(x),
  Cpp_colSums(x),
  Sugar_rowSums(x),
  Sugar_colSums(x),
  Arma_rowSums(x),
  Arma_colSums(x)
)[, 1:7]

#> # A tibble: 8 x 7
#>   expression            min     mean   median      max `itr/sec` mem_alloc
#>   <chr>            <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 rowSums(x)        192.2ms  207.9ms  194.6ms  236.9ms      4.81    78.2KB
#> 2 colSums(x)         93.4ms   97.2ms   96.5ms  101.3ms     10.3     78.2KB
#> 3 Cpp_rowSums(x)     73.5ms   76.3ms     76ms   80.4ms     13.1     80.7KB
#> 4 Cpp_colSums(x)    126.5ms  127.6ms  126.8ms  130.3ms      7.84    80.7KB
#> 5 Sugar_rowSums(x)   73.9ms   75.6ms   74.3ms   79.4ms     13.2     80.7KB
#> 6 Sugar_colSums(x)  124.2ms  125.8ms  125.6ms  127.9ms      7.95    80.7KB
#> 7 Arma_rowSums(x)    73.2ms   74.7ms   73.9ms   79.3ms     13.4     80.7KB
#> 8 Arma_colSums(x)    62.8ms   64.4ms   63.7ms   69.6ms     15.5     80.7KB

(Again, you can find the C++ source file matsums.cpp here.)

Platform:

> sessioninfo::platform_info()
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 os       Windows >= 8 x64            
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United States.1252  
 tz       Europe/Helsinki             
 date     2018-08-09  

Update

Investigating further, I also wrote the same functions using R's traditional C interface: the source is here. I compiled the functions with R CMD SHLIB, and tested again: the same phenomenon of row sums being faster than col sums persisted (benchmarks). I then also looked at the disassembly with objdump, but it seems to me (with my very limited understanding of asm) that the compiler doesn't really optimize the main loop bodies (rows, cols) any further from the C code, either?

like image 481
Mikko Marttila Avatar asked Aug 09 '18 19:08

Mikko Marttila


1 Answers

First, let me show the timing statistics on my laptop. I use a 5000 x 5000 matrix which is sufficient for benchmarking, and microbenchmark package is used for 100 evaluations.

Unit: milliseconds
             expr       min        lq      mean    median        uq       max
       colSums(x)  71.40671  71.64510  71.80394  71.72543  71.80773  75.07696
   Cpp_colSums(x)  71.29413  71.42409  71.65525  71.48933  71.56241  77.53056
 Sugar_colSums(x)  73.05281  73.19658  73.38979  73.25619  73.31406  76.93369
  Arma_colSums(x)  39.08791  39.34789  39.57979  39.43080  39.60657  41.70158
       rowSums(x) 177.33477 187.37805 187.57976 187.49469 187.73155 194.32120
   Cpp_rowSums(x)  54.00498  54.37984  54.70358  54.49165  54.73224  64.16104
 Sugar_rowSums(x)  54.17001  54.38420  54.73654  54.56275  54.75695  61.80466
  Arma_rowSums(x)  49.54407  49.77677  50.13739  49.90375  50.06791  58.29755

C code in R core is not always better than what we can write ourselves. That Cpp_rowSums is faster than rowSums shows this. I don't feel myself competent to explain why R's version is slower than it should be. I will focuse on: how we can further optimize our own colSums and rowSums to beat Armadillo. Note that I write C, use R's old C interface and do compilation with R CMD SHLIB.


Is there any substantial difference between colSums and rowSums?

If we have an n x n matrix that is much larger than the capacity of a CPU cache, colSums loads n x n data from RAM to cache, but rowSums loads as twice as many, i.e., 2 x n x n.

Think about the resulting vector that holds the sum: how many times this length-n vector is loaded into cache from RAM? For colSums, it is loaded only once, but for rowSums, it is loaded n times. Each time you add a matrix column to it, it is loaded into cache but then evicted since it is too big.

For a large n:

  • colSums causes n x n + n data load from RAM to cache;
  • rowSums causes n x n + n x n data load from RAM to cache.

In other words, rowSums is in theory less memory efficient, and is likely to be slower.


How to improve the performance of colSums?

Since the data flow between RAM and cache is readily optimal, the only improvement is loop unrolling. Unrolling the inner loop (the summation loop) by a depth of 2 is sufficient and we will see a 2x boost.

Loop unrolling works as it enables CPU's instruction pipeline. If we just do one addition per iteration, no pipelining is possible; with two additions this instruction-level parallelism starts to work. We can also unroll the loop by a depth of 4, but my experience is that a depth-2 unrolling is sufficient to gain most of the benefit from loop unrolling.

How to improve the performance of rowSums?

Optimization of data flow is the first step. We need to first do cache blocking to reduce the data transfer from 2 x n x n down to n x n.

Chop this n x n matrix into a number of row chunks: each being 2040 x n (the last chunk may be smaller), then apply the ordinary rowSums chunk by chunk. For each chunk, the accumulator vector has length-2040, about half of what a 32KB CPU cache can hold. The other half is reversed for a matrix column added to this accumulator vector. In this way, the accumulator vector can be hold in the cache until all matrix columns in this chunk are processed. As a result, the accumulator vector is only loaded into cache once, hence the overall memory performance is as good as that for colSums.

Now we can further apply loop unrolling for the rowSums in each chunk. Unroll both the outer loop and inner loop by a depth of 2, we will see a boost. Once the outer loop is unrolled, the chunk size should be reduced to 1360, as now we need space in the cache to hold three length-1360 vectors per outer loop iteration.


C code: Let's beat Armadillo

Writing code with loop unrolling can be a nasty job as we now need to write several different versions for a function.

For colSums, we need two versions:

  • colSums_1x1: both inner and outer loops are unrolled with depth 1, i.e., this is a version without loop unrolling;
  • colSums_2x1: no outer loop unrolling, while inner loop is unrolled with depth 2.

For rowSums we can have up to four versions, rowSums_sxt, where s = 1 or 2 is the unrolling depth for inner loop and t = 1 or 2 is the unrolling depth for outer loop.

Code writing can be very tedious if we write each version one by one. After many years or frustration on this I developed an "automatic code / version generation" trick using inlined template functions and macros.

#include <stdlib.h>
#include <Rinternals.h>

static inline void colSums_template_sx1 (size_t s,
                                         double *A, size_t LDA,
                                         size_t nr, size_t nc,
                                         double *sum) {

  size_t nrc = nr % s, i;
  double *A_end = A + LDA * nc, a0, a1;

  for (; A < A_end; A += LDA) {
    a0 = 0.0; a1 = 0.0;  // accumulator register variables
    if (nrc > 0) a0 = A[0];  // is there a "fractional loop"?
    for (i = nrc; i < nr; i += s) {  // main loop of depth-s
      a0 += A[i];  // 1st iteration
      if (s > 1) a1 += A[i + 1];  // 2nd iteration
      }
    if (s > 1) a0 += a1;  // combine two accumulators
    *sum++ = a0;  // write-back
    }

  }

#define macro_define_colSums(s, colSums_sx1) \
SEXP colSums_sx1 (SEXP matA) { \
  double *A = REAL(matA); \
  size_t nrow_A = (size_t)nrows(matA); \
  size_t ncol_A = (size_t)ncols(matA); \
  SEXP result = PROTECT(allocVector(REALSXP, ncols(matA))); \
  double *sum = REAL(result); \
  colSums_template_sx1(s, A, nrow_A, nrow_A, ncol_A, sum); \
  UNPROTECT(1); \
  return result; \
  }

macro_define_colSums(1, colSums_1x1)
macro_define_colSums(2, colSums_2x1)

The template function computes (in R-syntax) sum <- colSums(A[1:nr, 1:nc]) for a matrix A with LDA (leading dimension of A) rows. The parameter s is a version control on inner loop unrolling. The template function looks horrible at first glance as it contains many if. However, it is declared static inline. If it is called by passing in known constant 1 or 2 to s, an optimizing compiler is able to evaluate those if at compile-time, eliminate unreachable code and drop "set-but-not-used" variables (registers variables that are initialized, modified but not written back to RAM).

The macro is used for function declaration. Accepting a constant s and a function name, it generates a function with desired loop unrolling version.

The following is for rowSums.

static inline void rowSums_template_sxt (size_t s, size_t t,
                                         double *A, size_t LDA,
                                         size_t nr, size_t nc,
                                         double *sum) {

  size_t ncr = nc % t, nrr = nr % s, i;
  double *A_end = A + LDA * nc, *B;
  double a0, a1;

  for (i = 0; i < nr; i++) sum[i] = 0.0;  // necessary initialization

  if (ncr > 0) {  // is there a "fractional loop" for the outer loop?
    if (nrr > 0) sum[0] += A[0];  // is there a "fractional loop" for the inner loop?
    for (i = nrr; i < nr; i += s) {  // main inner loop with depth-s
      sum[i] += A[i];
      if (s > 1) sum[i + 1] += A[i + 1];
      }
    A += LDA;
    }

  for (; A < A_end; A += t * LDA) {  // main outer loop with depth-t
    if (t > 1) B = A + LDA;
    if (nrr > 0) {  // is there a "fractional loop" for the inner loop?
      a0 = A[0]; if (t > 1) a0 += A[LDA];
      sum[0] += a0;
      }
    for(i = nrr; i < nr; i += s) {  // main inner loop with depth-s
      a0 = A[i]; if (t > 1) a0 += B[i];
      sum[i] += a0;
      if (s > 1) {
        a1 = A[i + 1]; if (t > 1) a1 += B[i + 1];
        sum[i + 1] += a1;
        }
      }
    }

  }

#define macro_define_rowSums(s, t, rowSums_sxt) \
SEXP rowSums_sxt (SEXP matA, SEXP chunk_size) { \
  double *A = REAL(matA); \
  size_t nrow_A = (size_t)nrows(matA); \
  size_t ncol_A = (size_t)ncols(matA); \
  SEXP result = PROTECT(allocVector(REALSXP, nrows(matA))); \
  double *sum = REAL(result); \
  size_t block_size = (size_t)asInteger(chunk_size); \
  size_t i, block_size_i; \
  if (block_size > nrow_A) block_size = nrow_A; \
  for (i = 0; i < nrow_A; i += block_size_i) { \
    block_size_i = nrow_A - i; if (block_size_i > block_size) block_size_i = block_size; \
    rowSums_template_sxt(s, t, A, nrow_A, block_size_i, ncol_A, sum); \
    A += block_size_i; sum += block_size_i; \
    } \
  UNPROTECT(1); \
  return result; \
  }

macro_define_rowSums(1, 1, rowSums_1x1)
macro_define_rowSums(1, 2, rowSums_1x2)
macro_define_rowSums(2, 1, rowSums_2x1)
macro_define_rowSums(2, 2, rowSums_2x2)

Note that the template function now accepts s and t, and the function to be defined by the macro has applied row chunking.

Even though I've left some comments along the code, the code is probably still not easy to follow, but I can't take more time to explain in greater details.

To use them, copy and paste them into a C file called "matSums.c" and compile it with R CMD SHLIB -c matSums.c.

For the R side, define the following functions in "matSums.R".

colSums_zheyuan <- function (A, s) {
  dyn.load("matSums.so")
  if (s == 1) result <- .Call("colSums_1x1", A)
  if (s == 2) result <- .Call("colSums_2x1", A)
  dyn.unload("matSums.so")
  result
  }

rowSums_zheyuan <- function (A, chunk.size, s, t) {
  dyn.load("matSums.so")
  if (s == 1 && t == 1) result <- .Call("rowSums_1x1", A, as.integer(chunk.size))
  if (s == 2 && t == 1) result <- .Call("rowSums_2x1", A, as.integer(chunk.size))
  if (s == 1 && t == 2) result <- .Call("rowSums_1x2", A, as.integer(chunk.size))
  if (s == 2 && t == 2) result <- .Call("rowSums_2x2", A, as.integer(chunk.size))
  dyn.unload("matSums.so")
  result
  }

Now let's have a benchmark, again with a 5000 x 5000 matrix.

A <- matrix(0, 5000, 5000)

library(microbenchmark)
source("matSums.R")

microbenchmark("col0" = colSums(A),
               "col1" = colSums_zheyuan(A, 1),
               "col2" = colSums_zheyuan(A, 2),
               "row0" = rowSums(A),
               "row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
               "row2" = rowSums_zheyuan(A, 2040, 1, 1),
               "row3" = rowSums_zheyuan(A, 1360, 1, 2),
               "row4" = rowSums_zheyuan(A, 1360, 2, 2))

On my laptop I get:

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
 col0  65.33908  71.67229  71.87273  71.80829  71.89444 111.84177   100
 col1  67.16655  71.84840  72.01871  71.94065  72.05975  77.84291   100
 col2  35.05374  38.98260  39.33618  39.09121  39.17615  53.52847   100
 row0 159.48096 187.44225 185.53748 187.53091 187.67592 202.84827   100
 row1  49.65853  54.78769  54.78313  54.92278  55.08600  60.27789   100
 row2  49.42403  54.56469  55.00518  54.74746  55.06866  60.31065   100
 row3  37.43314  41.57365  41.58784  41.68814  41.81774  47.12690   100
 row4  34.73295  37.20092  38.51019  37.30809  37.44097  99.28327   100

Note how loop unrolling speeds up both colSums and rowSums. And with full optimization ("col2" and "row4"), we beat Armadillo (see the timing table at the beginning of this answer).

The row chunking strategy does not clearly yield benefit in this case. Let's try a matrix with millions of rows.

A <- matrix(0, 1e+7, 20)
microbenchmark("row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
               "row2" = rowSums_zheyuan(A, 2040, 1, 1),
               "row3" = rowSums_zheyuan(A, 1360, 1, 2),
               "row4" = rowSums_zheyuan(A, 1360, 2, 2))

I get

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval
 row1 604.7202 607.0256 617.1687 607.8580 609.1728 720.1790   100
 row2 514.7488 515.9874 528.9795 516.5193 521.4870 636.0051   100
 row3 412.1884 413.8688 421.0790 414.8640 419.0537 525.7852   100
 row4 377.7918 379.1052 390.4230 379.9344 386.4379 476.9614   100

In this case we observe the gains from cache blocking.


Final thoughts

Basically this answer has addressed all the issues, except for the following:

  • why R's rowSums is less efficient than it should be.
  • why without any optimization, rowSums ("row1") is faster than colSums ("col1").

Again, I cannot explain the first and actually I don't care that since we can easily write a version that is faster than R's built-in version.

The 2nd is definitely worth pursuing. I copy in my comments in our discussion room for a record.

This issue is down to this: "why adding up a single vector is slower than adding two vectors element-wise?"

I see similar phenomenon from time to time. The first time I encountered this strange behavior was when I, a few years ago, coded my own matrix-matrix multiplication. I found that DAXPY is faster than DDOT.

DAXPY does this: y += a * x, where x and y are vectors and a is a scalar; DDOT does this: a += x * y.

Given than DDOT is a reduction operation I expect that it is faster than DAXPY. But no, DAXPY is faster.

However, as soon as I unroll the loop in the triple loop-nest of the matrix-multiplication, DDOT is much faster than DAXPY.

A very similar thing happens to your issue. A reduction operation: a = x[1] + x[2] + ... + x[n] is slower than element-wise add: y[i] += x[i]. But once loop unrolling is done, the advantage of the latter is lost.

I am not sure whether the following explanation is true as I have no evidence.

The reduction operation has a dependency chain so the computation is strictly serial; on the other hand, element-wise operation has no dependency chain, so that CPU may do better with it.

As soon as we unroll the loop, each iteration has more arithmetics to do and CPU can do pipelining in both cases. The true advantage of the reduction operation can then be observed.


In reply to Jaap on using rowSums2 and colSums2 from matrixStats

Still using the 5000 x 5000 example above.

A <- matrix(0, 5000, 5000)

library(microbenchmark)
source("matSums.R")
library(matrixStats)  ## NEW

microbenchmark("col0" = base::colSums(A),
               "col*" = matrixStats::colSums2(A),  ## NEW
               "col1" = colSums_zheyuan(A, 1),
               "col2" = colSums_zheyuan(A, 2),
               "row0" = base::rowSums(A),
               "row*" = matrixStats::rowSums2(A),  ## NEW
               "row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
               "row2" = rowSums_zheyuan(A, 2040, 1, 1),
               "row3" = rowSums_zheyuan(A, 1360, 1, 2),
               "row4" = rowSums_zheyuan(A, 1360, 2, 2))

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
 col0  71.53841  71.72628  72.13527  71.81793  71.90575  78.39645   100
 col*  75.60527  75.87255  76.30752  75.98990  76.18090  87.07599   100
 col1  71.67098  71.86180  72.06846  71.93872  72.03739  77.87816   100
 col2  38.88565  39.03980  39.57232  39.08045  39.16790  51.39561   100
 row0 187.44744 187.58121 188.98930 187.67168 187.86314 206.37662   100
 row* 158.08639 158.26528 159.01561 158.34864 158.62187 174.05457   100
 row1  54.62389  54.81724  54.97211  54.92394  55.04690  56.33462   100
 row2  54.15409  54.44208  54.78769  54.59162  54.76073  60.92176   100
 row3  41.43393  41.63886  42.57511  41.73538  41.81844 111.94846   100
 row4  37.07175  37.25258  37.45033  37.34456  37.47387  43.14157   100

I don't see performance advantage of rowSums2 and colSums2.

like image 200
Zheyuan Li Avatar answered Sep 23 '22 08:09

Zheyuan Li