Benchmarking "sample" function in R

Tags:

random

r

I was benchmarking the sample function in R and comparing it with igraph:sample_seq and ran into a strange result.

When I run something like:

library(microbenchmark)
library(igraph)
set.seed(1234)
N <- 55^4
M <- 500
(mbm <- microbenchmark(v1 = {sample(N,M)}, 
                       v2 = {igraph::sample_seq(1,N,M)}, times=50))

I get a result like this:

Unit: microseconds
 expr       min        lq        mean     median        uq       max neval
   v1 21551.475 22655.996 26966.22166 23748.2555 28340.974 47566.237    50
   v2    32.873    37.952    82.85238    81.7675    96.141   358.277    50

But when I run, for example,

set.seed(1234)
N <- 100^4
M <- 500
(mbm <- microbenchmark(v1 = {sample(N,M)}, 
                      v2 = {igraph::sample_seq(1,N,M)}, times=50))

I get a much faster result for sample:

Unit: microseconds
 expr    min     lq     mean  median     uq     max neval
   v1 52.165 55.636 64.70412 58.2395 78.636  88.120    50
   v2 39.174 43.504 62.09600 53.5715 73.253 176.419    50

It seems that when N is a power of 10 (or some other special number?), sample is much faster than other smaller N that are not powers of 10. Is this expected behavior or am I missing something?

982

asked Mar 04 '20 03:03

passerby51

1 Answers

sample() or rather sample.int() by default uses a hash algorithm when certain conditions are met, one being that n > 1e7.

If the second benchmark is rerun without hashing you'll see that it is also much slower than than the igraph function.

set.seed(1234)
N2 <- 100^4
M <- 500
(mbm <- microbenchmark(v1 = {sample.int(N2,M, useHash = FALSE)}, 
                       v2 = {igraph::sample_seq(1,N2,M)}, times=50))

Unit: microseconds
 expr        min         lq         mean     median         uq       max neval cld
   v1 144297.936 150368.649 167224.95664 154283.077 157832.520 407710.78    50   b
   v2     61.218     65.392     92.35544     87.885    118.262    148.87    50  a

From the documentation for the useHash argument:

logical indicating if the hash-version of the algorithm should be used. Can only be used for replace = FALSE, prob = NULL, and size <= n/2, and really should be used for large n, as useHash=FALSE will use memory proportional to n.

answered Oct 22 '22 19:10

Ritchie Sacramento

Related questions
                            
                                Assigning values in first rows of groups in a data.table
                            
                                How can I change the size of the strip on facets in a ggplot?
                            
                                R: Offer 5 seconds to demand a pause. If no pause demanded, resume the process
                            
                                Identifying specific differences between two data sets in R
                            
                                LinkedIn API: GET public profile from e-mail
                            
                                Add textbox to facet wrapped layout in ggplot2
                            
                                How is jitter determined in ggplot?
                            
                                Shiny: passing input$var to aes() in ggplot2
                            
                                R: apply a function to every element of two variables respectively
                            
                                R: Plot trees from h2o.randomForest() and h2o.gbm()
                            
                                Survival on binned data
                            
                                Set opacity of background map with ggmap
                            
                                How to rotate 180 degrees an mtext() in R
                            
                                RMarkdown button to show or hide code
                            
                                What are levels in R?
                            
                                How to load xlsx file using fread function?
                            
                                How to export tibble to .csv
                            
                                How to fix "Could not find shape for Powerpoint content Error: pandoc document conversion failed with error 63" Error
                            
                                Find array index of elements of a matrix that match a value in a vector of candidate values
                            
                                Mutating multiple columns dynamically while conditioning on specific rows

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With