This question is linked to NA values in Rcpp conditional. I basically have some Rcpp code that loop over multiple (double) elements. And I need to check if there are missing values, for each element (and I can't use vectorization). Let's count the number of missing values in a vector, just as minimal reproducible example: <pre class="prettyprint"><code>#include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] int nb_na(const NumericVector& x) { int n = x.size(); int c = 0; for (int i = 0; i < n; i++) if (R_IsNA(x[i])) c++; return c; } // [[Rcpp::export]] int nb_na3(const NumericVector& x) { int n = x.size(); int c = 0; for (int i = 0; i < n; i++) if (x[i] == 3) c++; return c; } // [[Rcpp::export]] LogicalVector na_real(NumericVector x) { return x == NA_REAL; } </code></pre> Then, in R, we get: <pre class="prettyprint"><code>> x <- rep(c(1, 2, NA), 1e4) > x2 <- replace(x, is.na(x), 3) > microbenchmark::microbenchmark( + nb_na(x), + nb_na3(x2) + ) Unit: microseconds expr min lq mean median uq max neval nb_na(x) 135.633 135.982 153.08586 139.753 140.3115 1294.928 100 nb_na3(x2) 22.490 22.908 30.14005 23.188 23.5025 684.026 100 > all.equal(nb_na(x), nb_na3(x2)) [1] TRUE > na_real(x[1:3]) [1] NA NA NA </code></pre> As noted in the linked question, you can't just check <code>x[i] == NA_REAL</code> because it always returns a missing value. Yet, using <code>R_IsNA(x[i])</code> is much slower that checking equality with a numeric value (e.g. <code>3</code>). Basically, I want a solution where I can check that a single value is a missing value. This solution should be as fast as checking equality with a numeric value.

Checking for missing value or any NaN specific variant is always going to be more expensive than checking for a specific value. That's just floating point arithmetic. However there's still room for improvement in your code. I would encourage you to use <code>NumericVector::is_na</code> instead of <code>R_IsNA</code> but this is mostly cosmetic. Then branching can be expensive, i.e. I'd replace <code>if (R_IsNA(x[i])) c++;</code> by <code>c += NumericVector::is_na(x[i])</code>. This gives this version: <pre class="prettyprint"><code>// [[Rcpp::export]] int nb_na4(const NumericVector& x) { int n = x.size(); int c = 0; for (int i = 0; i < n; i++) c += NumericVector::is_na(x[i]) ; return c; } </code></pre> Then iterating on an <code>int</code> and accessing <code>x[i]</code> can be replaced by using the <code>std::count_if</code> algorithm. This is it's raison d'être. Leading to this version: <pre class="prettyprint"><code>// [[Rcpp::export]] int nb_na5(const NumericVector& x) { return std::count_if(x.begin(), x.end(), NumericVector::is_na ) ; } </code></pre> Now if the performance is still not good enough, you might want to try parallelization, for this I typically use the <code>tbb</code> library from the <code>RcppParallel</code> package. <pre class="prettyprint"><code>// [[Rcpp::export]] int nb_na6(const NumericVector& x) { return tbb::parallel_reduce( tbb::blocked_range<const double*>(x.begin(), x.end()), 0, [](const tbb::blocked_range<const double*>& r, int init) -> int { return init + std::count_if( r.begin(), r.end(), NumericVector::is_na ); }, []( int x, int y){ return x+y; } ) ; } </code></pre> Benchmarking with this function: <pre class="prettyprint"><code>library(microbenchmark) bench <- function(n){ x <- rep(c(1, 2, NA), n) microbenchmark( nb_na = nb_na(x), nb_na4 = nb_na4(x), nb_na5 = nb_na5(x), nb_na6 = nb_na6(x) ) } bench(1e5) </code></pre> On my machine I get: <pre class="prettyprint"><code>> bench(1e4) Unit: microseconds expr min lq mean median uq max neval cld nb_na 84.358 94.6500 107.41957 110.482 118.9580 137.393 100 d nb_na4 59.984 69.4925 79.42195 82.442 85.9175 106.567 100 b nb_na5 65.047 75.2625 85.17134 87.501 93.0315 116.993 100 c nb_na6 39.205 51.0785 59.20582 54.457 68.9625 97.225 100 a > bench(1e5) Unit: microseconds expr min lq mean median uq max neval cld nb_na 730.416 732.2660 829.8440 797.4350 872.3335 1410.467 100 d nb_na4 520.800 521.6215 598.8783 562.7200 657.1755 1059.991 100 b nb_na5 578.527 579.3805 664.8795 626.5530 710.5925 1166.365 100 c nb_na6 294.486 345.2050 368.6664 353.6945 372.6205 897.552 100 a </code></pre> Another way is to totally circumvent floating point arithmetic and pretend the vector is a vector of <code>long long</code>, aka 64 bit integers and compare the values to the bit pattern of <code>NA_REAL</code>: <pre class="prettyprint"><code> > devtools::install_github( "ThinkR-open/seven31" ) > seven31::reveal(NA, NaN, +Inf, -Inf ) 0 11111111111 ( NaN ) 0000000000000000000000000000000000000000011110100010 : NA 0 11111111111 ( NaN ) 1000000000000000000000000000000000000000000000000000 : NaN 0 11111111111 ( NaN ) 0000000000000000000000000000000000000000000000000000 : +Inf 1 11111111111 ( NaN ) 0000000000000000000000000000000000000000000000000000 : -Inf </code></pre> A serial solution using this hack: <pre class="prettyprint"><code>// [[Rcpp::export]] int nb_na7( const NumericVector& x){ const long long* p = reinterpret_cast<const long long*>(x.begin()) ; long long na = *reinterpret_cast<long long*>(&NA_REAL) ; return std::count(p, p + x.size(), na ) ; } </code></pre> And then a parallel version: <pre class="prettyprint"><code>// [[Rcpp::export]] int nb_na8( const NumericVector& x){ const long long* p = reinterpret_cast<const long long*>(x.begin()) ; long long na = *reinterpret_cast<long long*>(&NA_REAL) ; auto count_chunk = [=](const tbb::blocked_range<const long long*>& r, int init) -> int { return init + std::count( r.begin(), r.end(), na); } ; return tbb::parallel_reduce( tbb::blocked_range<const long long*>(p, p + x.size()), 0, count_chunk, []( int x, int y){ return x+y; } ) ; } > bench(1e5) Unit: microseconds expr min lq mean median uq max neval cld nb_na 730.346 762.5720 839.9479 857.5865 881.8635 1045.048 100 f nb_na4 520.946 521.6850 589.0911 578.2825 653.4950 832.449 100 d nb_na5 578.621 579.3245 640.9772 616.8645 701.8125 890.736 100 e nb_na6 291.115 307.4300 340.1626 344.7955 360.7030 484.261 100 c nb_na7 122.156 123.4990 141.1954 132.6385 149.7895 253.988 100 b nb_na8 69.356 86.9980 109.6427 115.2865 126.2775 182.184 100 a > bench(1e6) Unit: microseconds expr min lq mean median uq max neval cld nb_na 7342.984 7956.3375 10261.583 9227.7450 10869.605 79757.09 100 d nb_na4 5286.970 5721.9150 7659.009 6660.2390 9234.646 31141.47 100 c nb_na5 5840.946 6272.7050 7307.055 6883.2430 8205.117 10420.48 100 c nb_na6 2833.378 2895.7160 3891.745 3049.4160 4054.022 18242.26 100 b nb_na7 1661.421 1791.1085 2708.992 1916.6055 2232.720 60827.63 100 ab nb_na8 650.639 869.6685 1289.373 939.0045 1291.025 10223.29 100 a </code></pre> This assumes there's only one bit pattern to represent <code>NA</code>. Here's my entire file for reference: <pre class="prettyprint"><code>#include <Rcpp.h> #include <RcppParallel.h> // [[Rcpp::depends(RcppParallel)]] // [[Rcpp::plugins(cpp11)]] using namespace Rcpp; // [[Rcpp::export]] int nb_na(const NumericVector& x) { int n = x.size(); int c = 0; for (int i = 0; i < n; i++) if (R_IsNA(x[i])) c++; return c; } // [[Rcpp::export]] int nb_na4(const NumericVector& x) { int n = x.size(); int c = 0; for (int i = 0; i < n; i++) c += NumericVector::is_na(x[i]) ; return c; } // [[Rcpp::export]] int nb_na5(const NumericVector& x) { return std::count_if(x.begin(), x.end(), NumericVector::is_na ) ; } // [[Rcpp::export]] int nb_na6(const NumericVector& x) { return tbb::parallel_reduce( tbb::blocked_range<const double*>(x.begin(), x.end()), 0, [](const tbb::blocked_range<const double*>& r, int init) -> int { return init + std::count_if( r.begin(), r.end(), NumericVector::is_na ); }, []( int x, int y){ return x+y; } ) ; } // [[Rcpp::export]] int nb_na7( const NumericVector& x){ const long long* p = reinterpret_cast<const long long*>(x.begin()) ; long long na = *reinterpret_cast<long long*>(&NA_REAL) ; return std::count(p, p + x.size(), na ) ; } // [[Rcpp::export]] int nb_na8( const NumericVector& x){ const long long* p = reinterpret_cast<const long long*>(x.begin()) ; long long na = *reinterpret_cast<long long*>(&NA_REAL) ; auto count_chunk = [=](const tbb::blocked_range<const long long*>& r, int init) -> int { return init + std::count( r.begin(), r.end(), na); } ; return tbb::parallel_reduce( tbb::blocked_range<const long long*>(p, p + x.size()), 0, count_chunk, []( int x, int y){ return x+y; } ) ; } /*** R library(microbenchmark) bench <- function(n){ x <- rep(c(1, 2, NA), n) microbenchmark( nb_na = nb_na(x), nb_na4 = nb_na4(x), nb_na5 = nb_na5(x), nb_na6 = nb_na6(x), nb_na7 = nb_na7(x), nb_na8 = nb_na8(x) ) } bench(1e5) bench(1e6) */ </code></pre>

Fast checking of missing values in Rcpp

Tags:

performance

r

rcpp

This question is linked to NA values in Rcpp conditional.

I basically have some Rcpp code that loop over multiple (double) elements. And I need to check if there are missing values, for each element (and I can't use vectorization). Let's count the number of missing values in a vector, just as minimal reproducible example:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
int nb_na(const NumericVector& x) {
  int n = x.size();
  int c = 0;
  for (int i = 0; i < n; i++) if (R_IsNA(x[i])) c++;
  return c;
}

// [[Rcpp::export]]
int nb_na3(const NumericVector& x) {
  int n = x.size();
  int c = 0;
  for (int i = 0; i < n; i++) if (x[i] == 3) c++;
  return c;
}

// [[Rcpp::export]]
LogicalVector na_real(NumericVector x) {
  return x == NA_REAL;
}

Then, in R, we get:

> x <- rep(c(1, 2, NA), 1e4)

> x2 <- replace(x, is.na(x), 3)

> microbenchmark::microbenchmark(
+   nb_na(x),
+   nb_na3(x2)
+ )
Unit: microseconds
       expr     min      lq      mean  median       uq      max neval
   nb_na(x) 135.633 135.982 153.08586 139.753 140.3115 1294.928   100
 nb_na3(x2)  22.490  22.908  30.14005  23.188  23.5025  684.026   100

> all.equal(nb_na(x), nb_na3(x2))
[1] TRUE

> na_real(x[1:3])
[1] NA NA NA

As noted in the linked question, you can't just check x[i] == NA_REAL because it always returns a missing value. Yet, using R_IsNA(x[i]) is much slower that checking equality with a numeric value (e.g. 3).

Basically, I want a solution where I can check that a single value is a missing value. This solution should be as fast as checking equality with a numeric value.

370

asked Oct 23 '17 14:10

F. Privé

2 Answers

Checking for missing value or any NaN specific variant is always going to be more expensive than checking for a specific value. That's just floating point arithmetic.

However there's still room for improvement in your code. I would encourage you to use NumericVector::is_na instead of R_IsNA but this is mostly cosmetic.

Then branching can be expensive, i.e. I'd replace if (R_IsNA(x[i])) c++; by c += NumericVector::is_na(x[i]). This gives this version:

// [[Rcpp::export]]
int nb_na4(const NumericVector& x) {
  int n = x.size();
  int c = 0;
  for (int i = 0; i < n; i++) c += NumericVector::is_na(x[i]) ;
  return c;
}

Then iterating on an int and accessing x[i] can be replaced by using the std::count_if algorithm. This is it's raison d'être. Leading to this version:

// [[Rcpp::export]]
int nb_na5(const NumericVector& x) {
  return std::count_if(x.begin(), x.end(), NumericVector::is_na ) ;
}

Now if the performance is still not good enough, you might want to try parallelization, for this I typically use the tbb library from the RcppParallel package.

// [[Rcpp::export]]
int nb_na6(const NumericVector& x) {
  return tbb::parallel_reduce( 
    tbb::blocked_range<const double*>(x.begin(), x.end()),
    0, 
    [](const tbb::blocked_range<const double*>& r, int init) -> int {
      return init + std::count_if( r.begin(), r.end(), NumericVector::is_na );
    }, 
    []( int x, int y){ return x+y; }
  ) ;
}

Benchmarking with this function:

library(microbenchmark)

bench <- function(n){
  x <- rep(c(1, 2, NA), n)
  microbenchmark(
    nb_na = nb_na(x), 
    nb_na4 = nb_na4(x), 
    nb_na5 = nb_na5(x), 
    nb_na6 = nb_na6(x)
  )
}
bench(1e5)

On my machine I get:

> bench(1e4)
Unit: microseconds
expr    min      lq      mean  median       uq     max neval  cld
nb_na  84.358 94.6500 107.41957 110.482 118.9580 137.393   100    d
nb_na4 59.984 69.4925  79.42195  82.442  85.9175 106.567   100  b  
nb_na5 65.047 75.2625  85.17134  87.501  93.0315 116.993   100   c 
nb_na6 39.205 51.0785  59.20582  54.457  68.9625  97.225   100 a   

> bench(1e5)
Unit: microseconds
expr     min       lq     mean   median       uq      max neval  cld
nb_na  730.416 732.2660 829.8440 797.4350 872.3335 1410.467   100    d
nb_na4 520.800 521.6215 598.8783 562.7200 657.1755 1059.991   100  b  
nb_na5 578.527 579.3805 664.8795 626.5530 710.5925 1166.365   100   c 
nb_na6 294.486 345.2050 368.6664 353.6945 372.6205  897.552   100 a

Another way is to totally circumvent floating point arithmetic and pretend the vector is a vector of long long, aka 64 bit integers and compare the values to the bit pattern of NA_REAL:

  > devtools::install_github( "ThinkR-open/seven31" )
  > seven31::reveal(NA, NaN, +Inf, -Inf )
  0 11111111111 ( NaN ) 0000000000000000000000000000000000000000011110100010 : NA
  0 11111111111 ( NaN ) 1000000000000000000000000000000000000000000000000000 : NaN
  0 11111111111 ( NaN ) 0000000000000000000000000000000000000000000000000000 : +Inf
  1 11111111111 ( NaN ) 0000000000000000000000000000000000000000000000000000 : -Inf

A serial solution using this hack:

// [[Rcpp::export]]
int nb_na7( const NumericVector& x){
  const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
  long long na = *reinterpret_cast<long long*>(&NA_REAL) ;

  return std::count(p, p + x.size(), na ) ;

}

And then a parallel version:

// [[Rcpp::export]]
int nb_na8( const NumericVector& x){
  const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
  long long na = *reinterpret_cast<long long*>(&NA_REAL) ;

  auto count_chunk = [=](const tbb::blocked_range<const long long*>& r, int init) -> int {
    return init + std::count( r.begin(), r.end(), na);
  } ;

  return tbb::parallel_reduce( 
    tbb::blocked_range<const long long*>(p, p + x.size()),
    0, 
    count_chunk, 
    []( int x, int y){ return x+y; }
  ) ;

}

  > bench(1e5)
  Unit: microseconds
     expr     min       lq     mean   median       uq      max neval    cld
    nb_na 730.346 762.5720 839.9479 857.5865 881.8635 1045.048   100      f
   nb_na4 520.946 521.6850 589.0911 578.2825 653.4950  832.449   100    d  
   nb_na5 578.621 579.3245 640.9772 616.8645 701.8125  890.736   100     e 
   nb_na6 291.115 307.4300 340.1626 344.7955 360.7030  484.261   100   c   
   nb_na7 122.156 123.4990 141.1954 132.6385 149.7895  253.988   100  b    
   nb_na8  69.356  86.9980 109.6427 115.2865 126.2775  182.184   100 a     

  > bench(1e6)
  Unit: microseconds
     expr      min        lq      mean    median        uq      max neval  cld
    nb_na 7342.984 7956.3375 10261.583 9227.7450 10869.605 79757.09   100    d
   nb_na4 5286.970 5721.9150  7659.009 6660.2390  9234.646 31141.47   100   c 
   nb_na5 5840.946 6272.7050  7307.055 6883.2430  8205.117 10420.48   100   c 
   nb_na6 2833.378 2895.7160  3891.745 3049.4160  4054.022 18242.26   100  b  
   nb_na7 1661.421 1791.1085  2708.992 1916.6055  2232.720 60827.63   100 ab  
   nb_na8  650.639  869.6685  1289.373  939.0045  1291.025 10223.29   100 a

This assumes there's only one bit pattern to represent NA.

Here's my entire file for reference:

#include <Rcpp.h>
#include <RcppParallel.h>

// [[Rcpp::depends(RcppParallel)]]
// [[Rcpp::plugins(cpp11)]]
using namespace Rcpp;

// [[Rcpp::export]]
int nb_na(const NumericVector& x) {
  int n = x.size();
  int c = 0;
  for (int i = 0; i < n; i++) if (R_IsNA(x[i])) c++;
  return c;
}

// [[Rcpp::export]]
int nb_na4(const NumericVector& x) {
  int n = x.size();
  int c = 0;
  for (int i = 0; i < n; i++) c += NumericVector::is_na(x[i]) ;
  return c;
}

// [[Rcpp::export]]
int nb_na5(const NumericVector& x) {
  return std::count_if(x.begin(), x.end(), NumericVector::is_na ) ;
}

// [[Rcpp::export]]
int nb_na6(const NumericVector& x) {
  return tbb::parallel_reduce( 
    tbb::blocked_range<const double*>(x.begin(), x.end()),
    0, 
    [](const tbb::blocked_range<const double*>& r, int init) -> int {
      return init + std::count_if( r.begin(), r.end(), NumericVector::is_na );
    }, 
    []( int x, int y){ return x+y; }
  ) ;
}

// [[Rcpp::export]]
int nb_na7( const NumericVector& x){
  const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
  long long na = *reinterpret_cast<long long*>(&NA_REAL) ;

  return std::count(p, p + x.size(), na ) ;

}

// [[Rcpp::export]]
int nb_na8( const NumericVector& x){
  const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
  long long na = *reinterpret_cast<long long*>(&NA_REAL) ;

  auto count_chunk = [=](const tbb::blocked_range<const long long*>& r, int init) -> int {
    return init + std::count( r.begin(), r.end(), na);
  } ;

  return tbb::parallel_reduce( 
    tbb::blocked_range<const long long*>(p, p + x.size()),
    0, 
    count_chunk, 
    []( int x, int y){ return x+y; }
  ) ;

}

/*** R
library(microbenchmark)

bench <- function(n){
  x <- rep(c(1, 2, NA), n)
  microbenchmark(
    nb_na = nb_na(x), 
    nb_na4 = nb_na4(x), 
    nb_na5 = nb_na5(x), 
    nb_na6 = nb_na6(x), 
    nb_na7 = nb_na7(x), 
    nb_na8 = nb_na8(x)
  )
}
bench(1e5)
bench(1e6)
*/

192

answered Sep 30 '22 18:09

Romain Francois

Checking for (IEEE) missing floating-point values is an expensive operating and there is no way around it. This is unrelated to R.

This is one reason why we're excited about the upcoming ALTREP in R - there we can for instance keep track of whether a double/real vector contains missing values or not - if it doesn't, then we don't have to waste time looking for them. Although not updated to mention ALTREP, you can get the gist from https://github.com/HenrikBengtsson/Wishlist-for-R/issues/12

answered Sep 30 '22 17:09

HenrikB

Related questions
                            
                                Piecewise regression with a quadratic polynomial and a straight line joining smoothly at a break point
                            
                                Error: .onLoad failed in loadNamespace() for 'tcltk', details:
                            
                                How to keep count in a recursive function in R?
                            
                                ROCR error: Format of predictions is invalid
                            
                                Using dcast.data.table with date values and aggregation
                            
                                How to add logo on ggplot2 footer
                            
                                Count number of values in R [duplicate]
                            
                                Converting to date in a character column that contains two date formats
                            
                                Counting the number of "0" in this factor
                            
                                How can I reverse numbers in a vector ONLY if they are sequential?
                            
                                clicking same plotly marker twice does not trigger events twice
                            
                                Export fitted regression splines (constructed by 'bs' or 'ns') as piecewise polynomials
                            
                                r heatmap - stat_density2d (ggmap) vs. addHeatmap (shiny leaflet)
                            
                                How to use values from a previous row and column
                            
                                Unnest one column list to many columns in tidyr
                            
                                Error in na.fail.default(as.ts(x)) : missing values in object in time series forecasting
                            
                                Map values to viridis colours in r
                            
                                Find the pair of most correlated variables
                            
                                Data frame to nested list
                            
                                creating a square matrix from a data frame [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With