Speeding up Rcpp `anyNA` equivalent

Question

This question is related to this old question and this old question.

R has the nice wrapper-ish function anyNA for quicker evaluation of any(is.na(x)). When working in Rcpp a similar minimal implementation could be given by:

// CharacterVector example
#include <Rcpp.h>
using namespace Rcpp;
template<typename T, typename S>
bool any_na(S x){
  T xx = as<T>(x);
  for(auto i : xx){
    if(T::is_na(i))
      return true;
  }
  return false;
}

// [[Rcpp::export(rng = false)]]
LogicalVector any_na(SEXP x){
  return any_na<CharacterVector>(x);
}

// [[Rcpp::export(rng = false)]]
SEXP overhead(SEXP x){
  CharacterVector xx = as<CharacterVector>(x);
  return wrap(xx);
}
/***R

library(microbenchmark)
vec <- sample(letters, 1e6, TRUE)
vec[1e6] <- NA_character_
any_na(vec)
# [1] TRUE
*/

But comparing the performance of this to anyNA I was surprised by the benchmark below

library(microbenchmark)
microbenchmark(
  Rcpp = any_na(vec), 
  R = anyNA(vec),
  overhead = overhead(vec), 
  unit = "ms"
)
Unit: milliseconds
     expr      min        lq     mean    median       uq      max neval cld
     Rcpp 2.647901 2.8059500 3.243573 3.0435010 3.675051 5.899100   100   c
        R 0.800300 0.8151005 0.952301 0.8577015 0.961201 3.467402   100  b 
 overhead 0.001300 0.0029010 0.011388 0.0122510 0.015751 0.048401   100 a

where the last line is the "overhead" incurred from converting back and forth from SEXP to CharacterVector (turns out to be negligible). As immediately evident the Rcpp version is roughly ~3.5 times slower than the R version. I was curious so I checked up on the source for Rcpp's is_na and finding no obvious reasons for the slow performance I continued to check the source for anyNA for R's own character vectors's and reimplementing the function using R's C API thinking to speed up this

// Added after SEXP overhead(SEXP x){ --- }
inline bool anyNA2(SEXP x){
  R_xlen_t n = Rf_length(x);
  for(R_xlen_t i = 0; i < n; i++){
    if(STRING_ELT(x, i) == NA_STRING)
      return true;
  }
  return false;
}
// [[Rcpp::export(rng = false)]]
SEXP any_na2(SEXP x){
  bool xx = anyNA2(x);
  return wrap(xx);
}
// [[Rcpp::export(rng = false)]]
SEXP any_na3(SEXP x){
  Function anyNA("anyNA");
  return anyNA(x);
}
/***R
microbenchmark(
  Rcpp = any_na(vec), 
  R = anyNA(vec),
  R_C_api = any_na2(vec),
  Rcpp_Function = any_na3(vec),
  overhead = overhead(vec),
  unit = "ms"
)
# Unit: milliseconds
# expr      min        lq       mean    median       uq      max neval  cld
# Rcpp 2.654901 2.8650515 3.54936501 3.2392510 3.997901 8.074201   100    d
# R 0.803701 0.8303015 1.01017200 0.9400015 1.061751 2.019902   100  b
# R_C_api 2.336402 2.4536510 3.01576302 2.7220010 3.314951 6.905101   100   c
# Rcpp_Function 0.844001 0.8862510 1.09259990 0.9597505 1.120701 3.011801   100  b
# overhead 0.001500 0.0071005 0.01459391 0.0146510 0.017651 0.101401   100 a
*/

Note that I've included a simple wrapper calling anyNA through Rcpp::Function as well. Once again this implementation of anyNA is not just a little but alot slower than the base implementation.

So the question becomes 2 fold:

Why is the Rcpp so much slower?
Derived from 1: How could this be "changed" to speed up the code?

The questions themselves are not very interesting in itself, but it is interesting if this is affecting multiple parts of Rcpp implementations that may in aggregate gain significant performance boosts.

SessonInfo()

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_Denmark.1252  LC_CTYPE=English_Denmark.1252    LC_MONETARY=English_Denmark.1252 LC_NUMERIC=C                     LC_TIME=English_Denmark.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.4-7    cmdline.arguments_0.0.1 glue_1.4.2              R6_2.5.0                Rcpp_1.0.6             

loaded via a namespace (and not attached):
 [1] codetools_0.2-18 lattice_0.20-41  mvtnorm_1.1-1    zoo_1.8-8        MASS_7.3-53      grid_4.0.3       multcomp_1.4-15  Matrix_1.2-18    sandwich_3.0-0   splines_4.0.3   
[11] TH.data_1.0-10   tools_4.0.3      survival_3.2-7   compiler_4.0.3

Edit (Not only a windows problem):

I wanted to make sure this is not a "Windows problem" so I went through and executed the problem within a Docker container running linux. The result is shown below and is very similar

# Unit: milliseconds
#           expr    min      lq     mean  median      uq     max neval
#           Rcpp 2.3399 2.62155 4.093380 3.12495 3.92155 26.2088   100
#              R 0.7635 0.84415 1.459659 1.10350 1.42145 12.1148   100
#        R_C_api 2.3358 2.56500 3.833955 3.11075 3.65925 14.2267   100
#  Rcpp_Function 0.8163 0.96595 1.574403 1.27335 1.56730 11.9240   100
#       overhead 0.0009 0.00530 0.013330 0.01195 0.01660  0.0824   100

Session info:

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-openmp/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] microbenchmark_1.4-7 Rcpp_1.0.5

loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2

thc · Accepted Answer

This is an interesting question, but the answer is pretty simple: there are two versions of STRING_ELT one used internally by R or if you set the USE_RINTERNALS macro in Rinlinedfuns.h and one for plebs in memory.c.

Comparing the two versions, you can see that the pleb version has more checks, which fully accounts for the difference in speed.

If you really want speed and don't care about safety, you can usually beat R by at least a little bit.

// [[Rcpp::export(rng = false)]]
bool any_na_unsafe(SEXP x) {
  SEXP* ptr = STRING_PTR(x);
  R_xlen_t n = Rf_xlength(x);
  for(R_xlen_t i=0; i<n; ++i) {
    if(ptr[i] == NA_STRING) return true;
  }
  return false;
}

Bench:

> microbenchmark(
+   R = anyNA(vec),
+   R_C_api = any_na2(vec),
+   unsafe = any_na_unsafe(vec),
+   unit = "ms"
+ )
Unit: milliseconds
    expr    min      lq     mean  median      uq     max neval
       R 0.5058 0.52830 0.553696 0.54000 0.55465  0.7758   100
 R_C_api 1.9990 2.05170 2.214136 2.06695 2.10220 12.2183   100
  unsafe 0.3170 0.33135 0.369585 0.35270 0.37730  1.2856   100

Although as written this is unsafe, if you add a few checks before the loop in the beginning it'd be fine.

Dirk Eddelbuettel · Answer

This questions turns out to be a good example of why some people rail and rant against microbenchmarks.

Baseline is a built-in primitive

The function that is supposed to be beat here is actually a primitive so that makes it a little tricky already

> anyNA
function (x, recursive = FALSE)  .Primitive("anyNA")
>

ALTREP puts a performance floor down

Next, a little experiment shows that the baseline function anyNA() never loops. We define a very short vector srt and a long vector lng, both contain a NA value. Turns out ... R is optimised via ALTREP keeping a matching bit in the data structure headers and the cost of checking is independent of length:

> srt <- c("A",NA_character_); lng <- c(rep("A", 1e6), NA_character_)
> microbenchmark(short=function(srt) { anyNA(srt) }, 
+                long=function(lng) { anyNA(lng) }, times=1000)
Unit: nanoseconds
  expr min lq   mean median uq   max neval cld
 short  48 50 69.324     51 53  5293  1000   a
  long  48 50 92.166     51 52 15494  1000   a
>

Note the units here (nanoseconds) and time spent. We are measuring looking at single bit.

(Edit: Scrab that. Thinko of mine in a rush, see comments.)

Rcpp functions have some small overhead

This is not new and documented. If you look at the code generated by Rcpp Attributes, conveniently giving us an R function of the same name of the C++ function we designate you see that at least one other function call is involved. Plus a baked-in try/catch layer, RNG setting (here turned off) and so on. That cannot be zero, and if amortized against anything reasonable it does neither matter not show up in measurements.

Here, however, the exercise was set up to match a primitive function looking at one bit. It's a race one cannot win. So here is my final table

> microbenchmark(anyNA = anyNA(vec), Rcpp_plain = rcpp_c_api(vec), 
+     Rcpp_tmpl = rcpp_any_na(vec), Rcpp_altrep = rcpp_altrep(vec), 
+     times = .... [TRUNCATED] 
Unit: microseconds
        expr      min      lq     mean   median      uq      max neval  cld
       anyNA  643.993  658.43  827.773  700.729  819.78  6280.85  5000 a   
  Rcpp_plain 1916.188 1952.55 2168.708 2022.017 2191.64  8506.71  5000    d
   Rcpp_tmpl 1709.380 1743.04 1933.043 1798.788 1947.83  8176.10  5000   c 
 Rcpp_altrep 1501.148 1533.88 1741.465 1590.572 1744.74 10584.93  5000  b

It contains the primitive R function, the original (templated) C++ function which looks pretty good still, something using Rcpp (and its small overhead) with just C API use (plus the automatic wrappers in/out) a little slower -- and then for comparison a function from Michel's checkmate package which does look at the ALTREP bit. And it is barely faster.

So really what we are looking at here is overhead from function calls getting in the way of measurning a micro-operations. So no, Rcpp cannot be made faster than a highly optimised primitive. The question looked interesting, but was, at the end of the day, somewhat ill-posed. Sometimes it is worth working through that.

My code version follows below.

// CharacterVector example
#include <Rcpp.h>
using namespace Rcpp;
template<typename T, typename S>
bool any_na(S x){
    T xx = as<T>(x);
    for (auto i : xx){
        if (T::is_na(i))
            return true;
    }
    return false;
}

// [[Rcpp::export(rng = false)]]
LogicalVector rcpp_any_na(SEXP x){
    return any_na<CharacterVector>(x);
}

// [[Rcpp::export(rng = false)]]
SEXP overhead(SEXP x){
    CharacterVector xx = as<CharacterVector>(x);
    return wrap(xx);
}

// [[Rcpp::export(rng = false)]]
bool rcpp_c_api(SEXP x) {
    R_xlen_t n = Rf_length(x);
    for (R_xlen_t i = 0; i < n; i++) {
        if(STRING_ELT(x, i) == NA_STRING)
            return true;
  }
  return false;
}

// [[Rcpp::export(rng = false)]]
SEXP any_na3(SEXP x){
  Function anyNA("anyNA");
  return anyNA(x);
}

// courtesy of the checkmate package
// [[Rcpp::export(rng=false)]]
R_xlen_t rcpp_altrep(SEXP x) {
#if defined(R_VERSION) && R_VERSION >= R_Version(3, 5, 0)
    if (STRING_NO_NA(x))
        return 0;
#endif
    const R_xlen_t nx = Rf_xlength(x);
    for (R_xlen_t i = 0; i < nx; i++) {
        if (STRING_ELT(x, i) == NA_STRING)
            return i + 1;
    }
    return 0;
}


/***R
library(microbenchmark)

srt <- c("A",NA_character_)
lng <- c(rep("A", 1e6), NA_character_)
microbenchmark(short = function(srt) { anyNA(srt) },
               long = function(lng) { anyNA(lng) },
               times=1000)

N <- 1e6
vec <- sample(letters, N, TRUE)
vec[N] <- NA_character_
anyNA(vec)                      # to check


microbenchmark(
  anyNA       = anyNA(vec),
  Rcpp_plain  = rcpp_c_api(vec),
  Rcpp_tmpl   = rcpp_any_na(vec),
  Rcpp_altrep = rcpp_altrep(vec),
  #Rcpp_Function = any_na3(vec),
  #overhead = overhead(vec),
  times = 5000
#  unit="relative"
)
*/

Speeding up Rcpp `anyNA` equivalent

Tags:

r

rcpp

SessonInfo()

Edit (Not only a windows problem):

Oliver

2 Answers

thc

Baseline is a built-in primitive

ALTREP puts a performance floor down

Rcpp functions have some small overhead

Dirk Eddelbuettel

Recent Activity

Donate For Us

Speeding up Rcpp `anyNA` equivalent

Tags:

r

rcpp

SessonInfo()

Edit (Not only a windows problem):

Oliver

2 Answers

thc

Baseline is a built-in primitive

ALTREP puts a performance floor down

Rcpp functions have some small overhead

Dirk Eddelbuettel

Related questions

Recent Activity

Donate For Us