Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast sorted sample with replacement

Tags:

performance

r

Basically, I want to do sort(sample(n, replace = TRUE)), for n = 1e6, and many times (e.g. 1e4 times).

Any way to do it faster in R(cpp)?

like image 732
F. Privé Avatar asked Jun 26 '19 18:06

F. Privé


2 Answers

Implementing @Frank's idea in Rcpp:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector sort_sample(int n) {
  IntegerVector tab(n);
  for (int i = 0; i < n; i++) {
    int k = n * unif_rand();
    tab[k]++;
  }
  return tab;
}

Benchmark:

microbenchmark::microbenchmark(
  sort(sample(n, replace = TRUE)),
  tabulate(sample(n, replace = TRUE), n),
  sort_sample(n)
)

For n = 1e5:

Unit: microseconds
                                   expr      min       lq      mean    median        uq      max neval
        sort(sample(n, replace = TRUE)) 3214.726 3393.606 3667.1001 3507.0525 3716.3560 7007.411   100
 tabulate(sample(n, replace = TRUE), n) 1087.543 1104.215 1245.0722 1136.9085 1218.5865 4265.075   100
                         sort_sample(n)  789.403  812.780  870.2723  837.3445  924.4395 1188.432   100

For n = 1e6:

Unit: milliseconds
                                   expr      min       lq     mean   median       uq       max neval
        sort(sample(n, replace = TRUE)) 49.96651 52.58784 61.19775 55.09312 58.43035 160.16275   100
 tabulate(sample(n, replace = TRUE), n) 13.74391 14.44253 17.22742 15.99816 17.54367  48.72374   100
                         sort_sample(n) 12.80741 14.40371 17.98320 15.31699 17.53548  63.21692   100
like image 168
F. Privé Avatar answered Nov 14 '22 22:11

F. Privé


As mentioned by @Frank in the comments, it makes sense to use tabulate instead of sort. Once one uses that, it makes sense to think about faster sampling methods as provided by my dqrng package:

library(dqrng)
m <- 1e6
bm <- bench::mark(sort(sample.int(m, replace = TRUE)),
                  tabulate(sample.int(m, replace = TRUE)),
                  sort(dqsample.int(m, replace = TRUE)),
                  tabulate(dqsample.int(m, replace = TRUE)),
                  check = FALSE)
bm[, 1:4]
#> # A tibble: 4 x 4
#>   expression                                     min   median `itr/sec`
#>   <bch:expr>                                <bch:tm> <bch:tm>     <dbl>
#> 1 sort(sample.int(m, replace = TRUE))         72.3ms   75.5ms      13.2
#> 2 tabulate(sample.int(m, replace = TRUE))     22.8ms   27.7ms      34.6
#> 3 sort(dqsample.int(m, replace = TRUE))       59.5ms     64ms      15.3
#> 4 tabulate(dqsample.int(m, replace = TRUE))   14.4ms   16.3ms      57.0

Created on 2019-06-27 by the reprex package (v0.3.0)

Note that I am still using R 3.5 on this machine. With R 3.6, the difference between sample.int and dqsample.int would be larger. Note also that one no longer needs a development version of dqrng to get the fast sampling methods.

One can also use the RNGs from dqrng via C++, but that does not make much difference compared with tabulate(dqsample.int(...)) in my experience:

#include <Rcpp.h>
// [[Rcpp::depends(dqrng, sitmo)]]
#include <dqrng_generator.h>
#include <convert_seed.h>
#include <R_randgen.h>
// [[Rcpp::plugins(cpp11)]]

// [[Rcpp::export]]
Rcpp::IntegerVector sort_sample(uint32_t n) {
    Rcpp::IntegerVector seed(2, dqrng::R_random_int);
    auto rng = dqrng::generator(dqrng::convert_seed<uint64_t>(seed));
    Rcpp::IntegerVector tab(n);
    for (uint32_t i = 0; i < n; i++) {
        uint32_t k = (*rng)(n);
        tab[k]++;
    }
    return tab;
}
like image 31
Ralf Stubner Avatar answered Nov 14 '22 23:11

Ralf Stubner