This question led to a new R package:
wrswoR
R's default sampling without replacement using sample.int
seems to require quadratic run time, e.g. when using weights drawn from a uniform distribution. This is slow for large sample sizes. Does anybody know a faster implementation that would be usable from within R? Two options are "rejection sampling with replacement" (see this question on stats.sx) and the algorithm by Wong and Easton (1980) (with a Python implementation in a StackOverflow answer).
Thanks to Ben Bolker for hinting at the C function that is called internally when sample.int
is called with replace=F
and non-uniform weights: ProbSampleNoReplace
. Indeed, the code shows two nested for
loops (line 420 ff of random.c
).
Here's the code to analyze the run time empirically:
library(plyr) sample.int.test <- function(n, p) { sample.int(2 * n, n, replace=F, prob=p); NULL } times <- ldply( 1:7, function(i) { n <- 1024 * (2 ** i) p <- runif(2 * n) data.frame( n=n, user=system.time(sample.int.test(n, p), gcFirst=T)['user.self']) }, .progress='text' ) times library(ggplot2) ggplot(times, aes(x=n, y=user/n)) + geom_point() + scale_x_log10() + ylab('Time per unit (s)') # Output: n user 1 2048 0.008 2 4096 0.028 3 8192 0.100 4 16384 0.408 5 32768 1.645 6 65536 6.604 7 131072 26.558
EDIT: Thanks to Arun for pointing out that unweighted sampling doesn't seem to have this performance penalty.
The precision of estimates is usually higher for sampling without replacement comparing to sampling with replacement. For example, it is possible to select only one element n times when sampling is done with replacement in an extreme case.
In sampling without replacement, the formula for the standard deviation of all sample means for samples of size n must be modified by including a finite population correction. The formula becomes: where N is the population size, N=6 in this example, and n is the sample size, n=4 in this case.
Sampling is called without replacement when a unit is selected at random from the population and it is not returned to the main lot. The first unit is selected out of a population of size N and the second unit is selected out of the remaining population of N–1 units, and so on.
In sampling without replacement, each sample unit of the population has only one chance to be selected in the sample. For example, if one draws a simple random sample such that no unit occurs more than one time in the sample, the sample is drawn without replacement.
Update:
An Rcpp
implementation of Efraimidis & Spirakis algorithm (thanks to @Hemmo, @Dinrem, @krlmlr and @rtlgrmpf):
library(inline) library(Rcpp) src <- ' int num = as<int>(size), x = as<int>(n); Rcpp::NumericVector vx = Rcpp::clone<Rcpp::NumericVector>(x); Rcpp::NumericVector pr = Rcpp::clone<Rcpp::NumericVector>(prob); Rcpp::NumericVector rnd = rexp(x) / pr; for(int i= 0; i<vx.size(); ++i) vx[i] = i; std::partial_sort(vx.begin(), vx.begin() + num, vx.end(), Comp(rnd)); vx = vx[seq(0, num - 1)] + 1; return vx; ' incl <- ' struct Comp{ Comp(const Rcpp::NumericVector& v ) : _v(v) {} bool operator ()(int a, int b) { return _v[a] < _v[b]; } const Rcpp::NumericVector& _v; }; ' funFast <- cxxfunction(signature(n = "Numeric", size = "integer", prob = "numeric"), src, plugin = "Rcpp", include = incl) # See the bottom of the answer for comparison p <- c(995/1000, rep(1/1000, 5)) n <- 100000 system.time(print(table(replicate(funFast(6, 3, p), n = n)) / n)) 1 2 3 4 5 6 1.00000 0.39996 0.39969 0.39973 0.40180 0.39882 user system elapsed 3.93 0.00 3.96 # In case of: # Rcpp::IntegerVector vx = Rcpp::clone<Rcpp::IntegerVector>(x); # i.e. instead of NumericVector 1 2 3 4 5 6 1.00000 0.40150 0.39888 0.39925 0.40057 0.39980 user system elapsed 1.93 0.00 2.03
Old version:
Let us try a few possible approaches:
Simple rejection sampling with replacement. This a far more simple function than sample.int.rej
offered by @krlmlr, i.e. sample size is always equal to n
. As we will see, it is still really fast assuming uniform distribution for weights, but extremely slow in another situation.
fastSampleReject <- function(all, n, w){ out <- numeric(0) while(length(out) < n) out <- unique(c(out, sample(all, n, replace = TRUE, prob = w))) out[1:n] }
The algorithm by Wong and Easton (1980). Here is an implementation of this Python version. It is stable and I might be missing something, but it is much slower compared to other functions.
fastSample1980 <- function(all, n, w){ tws <- w for(i in (length(tws) - 1):0) tws[1 + i] <- sum(tws[1 + i], tws[1 + 2 * i + 1], tws[1 + 2 * i + 2], na.rm = TRUE) out <- numeric(n) for(i in 1:n){ gas <- tws[1] * runif(1) k <- 0 while(gas > w[1 + k]){ gas <- gas - w[1 + k] k <- 2 * k + 1 if(gas > tws[1 + k]){ gas <- gas - tws[1 + k] k <- k + 1 } } wgh <- w[1 + k] out[i] <- all[1 + k] w[1 + k] <- 0 while(1 + k >= 1){ tws[1 + k] <- tws[1 + k] - wgh k <- floor((k - 1) / 2) } } out }
Rcpp implementation of the algorithm by Wong and Easton. Possibly it can be optimized even more since this is my first usable Rcpp
function, but anyway it works well.
library(inline) library(Rcpp) src <- ' Rcpp::NumericVector weights = Rcpp::clone<Rcpp::NumericVector>(w); Rcpp::NumericVector tws = Rcpp::clone<Rcpp::NumericVector>(w); Rcpp::NumericVector x = Rcpp::NumericVector(all); int k, num = as<int>(n); Rcpp::NumericVector out(num); double gas, wgh; if((weights.size() - 1) % 2 == 0){ tws[((weights.size()-1)/2)] += tws[weights.size()-1] + tws[weights.size()-2]; } else { tws[floor((weights.size() - 1)/2)] += tws[weights.size() - 1]; } for (int i = (floor((weights.size() - 1)/2) - 1); i >= 0; i--){ tws[i] += (tws[2 * i + 1]) + (tws[2 * i + 2]); } for(int i = 0; i < num; i++){ gas = as<double>(runif(1)) * tws[0]; k = 0; while(gas > weights[k]){ gas -= weights[k]; k = 2 * k + 1; if(gas > tws[k]){ gas -= tws[k]; k += 1; } } wgh = weights[k]; out[i] = x[k]; weights[k] = 0; while(k > 0){ tws[k] -= wgh; k = floor((k - 1) / 2); } tws[0] -= wgh; } return out; ' fun <- cxxfunction(signature(all = "numeric", n = "integer", w = "numeric"), src, plugin = "Rcpp")
Now some results:
times1 <- ldply( 1:6, function(i) { n <- 1024 * (2 ** i) p <- runif(2 * n) # Uniform distribution p <- p/sum(p) data.frame( n=n, user=c(system.time(sample.int.test(n, p), gcFirst=T)['user.self'], system.time(weighted_Random_Sample(1:(2*n), p, n), gcFirst=T)['user.self'], system.time(fun(1:(2*n), n, p), gcFirst=T)['user.self'], system.time(sample.int.rej(2*n, n, p), gcFirst=T)['user.self'], system.time(fastSampleReject(1:(2*n), n, p), gcFirst=T)['user.self'], system.time(fastSample1980(1:(2*n), n, p), gcFirst=T)['user.self']), id=c("Base", "Reservoir", "Rcpp", "Rejection", "Rejection simple", "1980")) }, .progress='text' ) times2 <- ldply( 1:6, function(i) { n <- 1024 * (2 ** i) p <- runif(2 * n - 1) p <- p/sum(p) p <- c(0.999, 0.001 * p) # Special case data.frame( n=n, user=c(system.time(sample.int.test(n, p), gcFirst=T)['user.self'], system.time(weighted_Random_Sample(1:(2*n), p, n), gcFirst=T)['user.self'], system.time(fun(1:(2*n), n, p), gcFirst=T)['user.self'], system.time(sample.int.rej(2*n, n, p), gcFirst=T)['user.self'], system.time(fastSampleReject(1:(2*n), n, p), gcFirst=T)['user.self'], system.time(fastSample1980(1:(2*n), n, p), gcFirst=T)['user.self']), id=c("Base", "Reservoir", "Rcpp", "Rejection", "Rejection simple", "1980")) }, .progress='text' )
arrange(times1, id) n user id 1 2048 0.53 1980 2 4096 0.94 1980 3 8192 2.00 1980 4 16384 4.32 1980 5 32768 9.10 1980 6 65536 21.32 1980 7 2048 0.02 Base 8 4096 0.05 Base 9 8192 0.18 Base 10 16384 0.75 Base 11 32768 2.99 Base 12 65536 12.23 Base 13 2048 0.00 Rcpp 14 4096 0.01 Rcpp 15 8192 0.03 Rcpp 16 16384 0.07 Rcpp 17 32768 0.14 Rcpp 18 65536 0.31 Rcpp 19 2048 0.00 Rejection 20 4096 0.00 Rejection 21 8192 0.00 Rejection 22 16384 0.02 Rejection 23 32768 0.02 Rejection 24 65536 0.03 Rejection 25 2048 0.00 Rejection simple 26 4096 0.01 Rejection simple 27 8192 0.00 Rejection simple 28 16384 0.01 Rejection simple 29 32768 0.00 Rejection simple 30 65536 0.05 Rejection simple 31 2048 0.00 Reservoir 32 4096 0.00 Reservoir 33 8192 0.00 Reservoir 34 16384 0.02 Reservoir 35 32768 0.03 Reservoir 36 65536 0.05 Reservoir arrange(times2, id) n user id 1 2048 0.43 1980 2 4096 0.93 1980 3 8192 2.00 1980 4 16384 4.36 1980 5 32768 9.08 1980 6 65536 19.34 1980 7 2048 0.01 Base 8 4096 0.04 Base 9 8192 0.18 Base 10 16384 0.75 Base 11 32768 3.11 Base 12 65536 12.04 Base 13 2048 0.01 Rcpp 14 4096 0.02 Rcpp 15 8192 0.03 Rcpp 16 16384 0.08 Rcpp 17 32768 0.15 Rcpp 18 65536 0.33 Rcpp 19 2048 0.00 Rejection 20 4096 0.00 Rejection 21 8192 0.02 Rejection 22 16384 0.02 Rejection 23 32768 0.05 Rejection 24 65536 0.08 Rejection 25 2048 1.43 Rejection simple 26 4096 2.87 Rejection simple 27 8192 6.17 Rejection simple 28 16384 13.68 Rejection simple 29 32768 29.74 Rejection simple 30 65536 73.32 Rejection simple 31 2048 0.00 Reservoir 32 4096 0.00 Reservoir 33 8192 0.02 Reservoir 34 16384 0.02 Reservoir 35 32768 0.02 Reservoir 36 65536 0.04 Reservoir
Obviously we can reject function 1980
because it is slower than Base
in both cases. Rejection simple
gets into trouble too when there is a single probability 0.999 in the second case.
So there remains Rejection
, Rcpp
, Reservoir
. The last step is checking whether the values themselves are correct. To be sure about them, we will be using sample
as a benchmark (also to eliminate the confusion about probabilities which do not have to coincide with p
because of sampling without replacement).
p <- c(995/1000, rep(1/1000, 5)) n <- 100000 system.time(print(table(replicate(sample(1:6, 3, repl = FALSE, prob = p), n = n))/n)) 1 2 3 4 5 6 1.00000 0.39992 0.39886 0.40088 0.39711 0.40323 # Benchmark user system elapsed 1.90 0.00 2.03 system.time(print(table(replicate(sample.int.rej(2*3, 3, p), n = n))/n)) 1 2 3 4 5 6 1.00000 0.40007 0.40099 0.39962 0.40153 0.39779 user system elapsed 76.02 0.03 77.49 # Slow system.time(print(table(replicate(weighted_Random_Sample(1:6, p, 3), n = n))/n)) 1 2 3 4 5 6 1.00000 0.49535 0.41484 0.36432 0.36338 0.36211 # Incorrect user system elapsed 3.64 0.01 3.67 system.time(print(table(replicate(fun(1:6, 3, p), n = n))/n)) 1 2 3 4 5 6 1.00000 0.39876 0.40031 0.40219 0.40039 0.39835 user system elapsed 4.41 0.02 4.47
Notice a few things here. For some reason weighted_Random_Sample
returns incorrect values (I have not looked into it at all, but it works correct assuming uniform distribution). sample.int.rej
is very slow in repeated sampling.
In conclusion, it seems that Rcpp
is the optimal choice in case of repeated sampling while sample.int.rej
is a bit faster otherwise and also easier to use.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With