Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Deciding between NumericVector and arma::vec in Rcpp

Tags:

r

rcpp

armadillo

With RcppArmadillo the conversion from R to Rcpp with arma::vec is just as easy as with Rcpp and NumericVector. My project utilizes RcppArmadillo.

I'm unsure what to use, NumericVector or arma::vec? What are the key differences between those two? When to use which? Is there a performance/memory advantage of using one over the other? Are the only difference the member functions? And, as a bonus question: should I even consider arma::colvec or arma::rowvec?

like image 988
Clawish Avatar asked May 04 '18 08:05

Clawish


2 Answers

What are the key differences between those two?

The *Vector and *Matrix classes in Rcpp act as wrappers for R's SEXP representation, e.g. an S expression that is as a pointer to the data. For details, please see Section 1.1 SEXPs of R Internals.Rcpp's design leverages this by constructing C++ objects from classes that enclose the pointer to the data. This promotes two key features:

  1. Seamless transference between R and C++ objects, and
  2. Low transference cost between R and C++ as only a pointer is passed.
    • as the data isn't copied but referenced

Meanwhile, arma objects are akin to a traditional std::vector<T> in the way that a deep copy occurs between the R and C++ objects. There is one exception to this statement, the presence of the advanced constructor, which allows for the memory behind the R object to be reused inside of an armadillo object's structure. Thus, if you are not careful, you may incur an unnecessary penalty during the transition from R to C++ and vice versa.

Note: The advanced constructor that allows the reuse of memory does not exist for arma::sp_mat. Thus, using references with sparse matrices will likely not yield the desired speed up as a copy is performed from R to C++ and back.

You can view the differences based largely on the "pass-by-reference" or "pass-by-copy" paradigm. To understand the difference outside of code, consider the following GIF by mathwarehouse:

To illustrate this scenario in code, consider the following three functions:

#include <RcppArmadillo.h>

// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
void memory_reference_double_ex(arma::vec& x, double value) {
    x.fill(value);
}

// [[Rcpp::export]]
void memory_reference_int_ex(arma::ivec& x, int value) {
    x.fill(value);
}

// [[Rcpp::export]]
arma::vec memory_copy_ex(arma::vec x, int value) {
    x.fill(value);
    return x;
}

The two functions memory_reference_double_ex() and memory_reference_int_ex() will update the object inside of R assuming the appropriate data type is present. As a result, we are able to avoid returning a value by specifying void in their definitions since the memory allocated by x is being reused. The third function, memory_copy_ex() requires a return type since it passes-by-copy and, thus, does not modify the existing storage without a reassignment call.

To emphasize:

  1. The x vector will be passed into C++ by reference, e.g. & on the end of arma::vec& or arma::ivec&, and
  2. The class of x in R is either double or integer meaning we are matching the underlying type of arma::vec, e.g Col<double>, or arma::ivec, e.g. Col<int>.

Let's quickly take a look at two examples.

Within the first example, we will look at the results from running memory_reference_double_ex() and compare it to the results generated by memory_copy_ex(). Note, the types between the objected defined in R and C++ are the same (e.g. double). In the next example, this will not hold.

x = c(0.1, 2.3, 4.8, 9.1)
typeof(x)
# [1] "double"

x
# [1] 0.1 2.3 4.8 9.1

# Nothing is returned...
memory_reference_double_ex(x, value = 9)

x
# [1] 9 9 9 9

a = memory_copy_ex(x, value = 3)

x
# [1] 9 9 9 9

a
#      [,1]
# [1,]    3
# [2,]    3
# [3,]    3
# [4,]    3

Now, what happens if the underlying type of the R object is an integer instead of a double?

x = c(1L, 2L, 3L, 4L)
typeof(x)
# [1] "integer"

x
# [1] 1 2 3 4

# Return nothing...
memory_reference_double_ex(x, value = 9)

x
# [1] 1 2 3 4

What happened? Why didn't x get updated? Well, behind the scenes Rcpp created a new memory allocation that was the proper type -- double and not int -- before passing it onto armadillo. This caused the reference "linkage" between the two objects to differ.

If we change to using an integer data type in the armadillo vector, notice we now have the same effect given previously:

memory_reference_int_ex(x, value = 3)

x
# [1] 3 3 3 3

This leads to discussion on the usefulness of these two paradigms. As speed is the preferred benchmark when working with C++, let's view this in terms of a benchmark.

Consider the following two functions:

#include <RcppArmadillo.h>

// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
void copy_double_ex(arma::vec x, double value) {
    x.fill(value);
}

// [[Rcpp::export]]
void reference_double_ex(arma::vec& x, double value) {
    x.fill(value);
}

Running a microbenchmark over them yields:

# install.packages("microbenchmark")
library("microbenchmark")

x = rep(1, 1e8)

micro_timings = microbenchmark(copy_double_ex(x, value = 9.0),
                               reference_double_ex(x, value = 9.0))
autoplot(micro_timings)
micro_timings

# Unit: milliseconds
#                               expr       min        lq      mean    median        uq      max neval
#       copy_double_ex(x, value = 9) 523.55708 529.23219 547.22669 536.71177 555.00069 640.5020   100
#  reference_double_ex(x, value = 9)  79.78624  80.70757  88.67695  82.44711  85.73199 308.4219   100

enter image description here

Note: The referenced object is ~ 6.509771 times faster per iteration than the copied paradigm as we do not have to reallocate and fill that memory.

When to use which?

What do you need to do?

Are you just trying to quickly speed up an algorithm that relies on a loop but lacks the need for rigorous linear algebra manipulation?

If so, just using Rcpp should suffice.

Are you trying to perform linear algebra manipulations? Or are you hoping to use this code across multiple libraries or computational platforms (e.g. MATLAB, Python, R, ...)?

If so, you should be writing the crux of the algorithm in armadillo and setting up the appropriate hooks to export the functions into R with Rcpp.

Is there a performance/memory advantage of using one over the other?

Yes, as indicated previously, there is definitely a performance / memory advantage. Not only that, but by using RcppArmadillo you are effectively adding an additional library ontop of Rcpp and, thus, increasing the overall installation footprint, compile time, and system requirements (see the woes of macOS builds). Figure out what is required by your project and then opt for that structure.

Are the only difference the member functions?

Not only member functions, but:

  • estimation routines in terms of matrix decomposition
  • computing statistical quantity values
  • object generation
  • sparse representation (avoid manipulating an S4 object)

These are fundamental differences between Rcpp and armadillo. One is meant to facilitate transference of R objects into C++ whereas the other is meant for more rigorous linear algebra computations. This should be largely evident as Rcpp does not implement any matrix multiplication logic whereas armadillo uses the system's Basic Linear Algebra Subprograms (BLAS) to perform the computation.

And, as a bonus question: should I even consider arma::colvec or arma::rowvec?

Depends on how you want the result to be returned. Do you want to have an: 1 x N (row vector) or N x 1 (column vector)? RcppArmadillo by default returns these structures as matrix objects with their appropriate dimensions and not a traditional 1D R vector.

As an example:

#include <RcppArmadillo.h>

// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
arma::vec col_example(int n) {
    arma::vec x = arma::randu<arma::vec>(n);
    return x;
}


// [[Rcpp::export]]
arma::rowvec row_example(int n) {
    arma::rowvec x = arma::randu<arma::rowvec>(n);
    return x;
}

Test:

set.seed(1)
col_example(4)
#           [,1]
# [1,] 0.2655087
# [2,] 0.3721239
# [3,] 0.5728534
# [4,] 0.9082078

set.seed(1)
row_example(4)
#           [,1]      [,2]      [,3]      [,4]
# [1,] 0.2655087 0.3721239 0.5728534 0.9082078
like image 120
coatless Avatar answered Oct 17 '22 10:10

coatless


The answer by @coatless is correct, but swarms you with detail you did not ask for.

At the same time, your question was under-specified as you gave no indication what you need the vectors for. With that caveat, I would say that

  • for simple use cases Rcpp is fine, and RcppArmadillo is equally fine
  • for use cases requiring linear algebra, prefer RcppArmadillo
  • performance will be largely equivalent, with the caveat that you want explicit 'call-by-reference' for RcppArmadillo as argued above
  • there is also a big difference between read-only vector access (say, a reduction like sum() or min() or lookup) and read-write access where it matters how you return the modified vector
  • all uses cases are generally much faster than R code, so in a first instance do not obsess over this.

Once you have it right, you can (and maybe should) profile.

like image 30
Dirk Eddelbuettel Avatar answered Oct 17 '22 08:10

Dirk Eddelbuettel