Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Benchmarking "sample" function in R

Tags:

random

r

I was benchmarking the sample function in R and comparing it with igraph:sample_seq and ran into a strange result.

When I run something like:

library(microbenchmark)
library(igraph)
set.seed(1234)
N <- 55^4
M <- 500
(mbm <- microbenchmark(v1 = {sample(N,M)}, 
                       v2 = {igraph::sample_seq(1,N,M)}, times=50))

I get a result like this:

Unit: microseconds
 expr       min        lq        mean     median        uq       max neval
   v1 21551.475 22655.996 26966.22166 23748.2555 28340.974 47566.237    50
   v2    32.873    37.952    82.85238    81.7675    96.141   358.277    50

But when I run, for example,

set.seed(1234)
N <- 100^4
M <- 500
(mbm <- microbenchmark(v1 = {sample(N,M)}, 
                      v2 = {igraph::sample_seq(1,N,M)}, times=50))

I get a much faster result for sample:

Unit: microseconds
 expr    min     lq     mean  median     uq     max neval
   v1 52.165 55.636 64.70412 58.2395 78.636  88.120    50
   v2 39.174 43.504 62.09600 53.5715 73.253 176.419    50

It seems that when N is a power of 10 (or some other special number?), sample is much faster than other smaller N that are not powers of 10. Is this expected behavior or am I missing something?

like image 982
passerby51 Avatar asked Mar 04 '20 03:03

passerby51


People also ask

What is the Microbenchmark package useful for?

The microbenchmark package is useful for running small sections of code to assess performance, as well as for comparing the speed of several functions that do the same thing.


1 Answers

sample() or rather sample.int() by default uses a hash algorithm when certain conditions are met, one being that n > 1e7.

If the second benchmark is rerun without hashing you'll see that it is also much slower than than the igraph function.

set.seed(1234)
N2 <- 100^4
M <- 500
(mbm <- microbenchmark(v1 = {sample.int(N2,M, useHash = FALSE)}, 
                       v2 = {igraph::sample_seq(1,N2,M)}, times=50))

Unit: microseconds
 expr        min         lq         mean     median         uq       max neval cld
   v1 144297.936 150368.649 167224.95664 154283.077 157832.520 407710.78    50   b
   v2     61.218     65.392     92.35544     87.885    118.262    148.87    50  a 

From the documentation for the useHash argument:

logical indicating if the hash-version of the algorithm should be used. Can only be used for replace = FALSE, prob = NULL, and size <= n/2, and really should be used for large n, as useHash=FALSE will use memory proportional to n.

like image 59
Ritchie Sacramento Avatar answered Oct 22 '22 19:10

Ritchie Sacramento