Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster weighted sampling without replacement

This question led to a new R package: wrswoR

R's default sampling without replacement using sample.int seems to require quadratic run time, e.g. when using weights drawn from a uniform distribution. This is slow for large sample sizes. Does anybody know a faster implementation that would be usable from within R? Two options are "rejection sampling with replacement" (see this question on stats.sx) and the algorithm by Wong and Easton (1980) (with a Python implementation in a StackOverflow answer).

Thanks to Ben Bolker for hinting at the C function that is called internally when sample.int is called with replace=F and non-uniform weights: ProbSampleNoReplace. Indeed, the code shows two nested for loops (line 420 ff of random.c).

Here's the code to analyze the run time empirically:

library(plyr)  sample.int.test <- function(n, p) {     sample.int(2 * n, n, replace=F, prob=p); NULL }  times <- ldply(   1:7,   function(i) {     n <- 1024 * (2 ** i)     p <- runif(2 * n)     data.frame(       n=n,       user=system.time(sample.int.test(n, p), gcFirst=T)['user.self'])   },   .progress='text' )  times  library(ggplot2) ggplot(times, aes(x=n, y=user/n)) + geom_point() + scale_x_log10() +   ylab('Time per unit (s)')  # Output:        n   user 1   2048  0.008 2   4096  0.028 3   8192  0.100 4  16384  0.408 5  32768  1.645 6  65536  6.604 7 131072 26.558 

Plot

EDIT: Thanks to Arun for pointing out that unweighted sampling doesn't seem to have this performance penalty.

like image 209
krlmlr Avatar asked Feb 27 '13 13:02

krlmlr


People also ask

Which is better sampling with or without replacement?

The precision of estimates is usually higher for sampling without replacement comparing to sampling with replacement. For example, it is possible to select only one element n times when sampling is done with replacement in an extreme case.

How do you solve sampling without replacement?

In sampling without replacement, the formula for the standard deviation of all sample means for samples of size n must be modified by including a finite population correction. The formula becomes: where N is the population size, N=6 in this example, and n is the sample size, n=4 in this case.

Is sampling done without replacement?

Sampling is called without replacement when a unit is selected at random from the population and it is not returned to the main lot. The first unit is selected out of a population of size N and the second unit is selected out of the remaining population of N–1 units, and so on.

What is sampling without replacement example?

In sampling without replacement, each sample unit of the population has only one chance to be selected in the sample. For example, if one draws a simple random sample such that no unit occurs more than one time in the sample, the sample is drawn without replacement.


1 Answers

Update:

An Rcpp implementation of Efraimidis & Spirakis algorithm (thanks to @Hemmo, @Dinrem, @krlmlr and @rtlgrmpf):

library(inline) library(Rcpp) src <-  ' int num = as<int>(size), x = as<int>(n); Rcpp::NumericVector vx = Rcpp::clone<Rcpp::NumericVector>(x); Rcpp::NumericVector pr = Rcpp::clone<Rcpp::NumericVector>(prob); Rcpp::NumericVector rnd = rexp(x) / pr; for(int i= 0; i<vx.size(); ++i) vx[i] = i; std::partial_sort(vx.begin(), vx.begin() + num, vx.end(), Comp(rnd)); vx = vx[seq(0, num - 1)] + 1; return vx; ' incl <-  ' struct Comp{   Comp(const Rcpp::NumericVector& v ) : _v(v) {}   bool operator ()(int a, int b) { return _v[a] < _v[b]; }   const Rcpp::NumericVector& _v; }; ' funFast <- cxxfunction(signature(n = "Numeric", size = "integer", prob = "numeric"),                        src, plugin = "Rcpp", include = incl)  # See the bottom of the answer for comparison p <- c(995/1000, rep(1/1000, 5)) n <- 100000 system.time(print(table(replicate(funFast(6, 3, p), n = n)) / n))        1       2       3       4       5       6  1.00000 0.39996 0.39969 0.39973 0.40180 0.39882     user  system elapsed     3.93    0.00    3.96  # In case of: # Rcpp::IntegerVector vx = Rcpp::clone<Rcpp::IntegerVector>(x); # i.e. instead of NumericVector       1       2       3       4       5       6  1.00000 0.40150 0.39888 0.39925 0.40057 0.39980     user  system elapsed     1.93    0.00    2.03  

Old version:

Let us try a few possible approaches:

Simple rejection sampling with replacement. This a far more simple function than sample.int.rej offered by @krlmlr, i.e. sample size is always equal to n. As we will see, it is still really fast assuming uniform distribution for weights, but extremely slow in another situation.

fastSampleReject <- function(all, n, w){   out <- numeric(0)   while(length(out) < n)     out <- unique(c(out, sample(all, n, replace = TRUE, prob = w)))   out[1:n] } 

The algorithm by Wong and Easton (1980). Here is an implementation of this Python version. It is stable and I might be missing something, but it is much slower compared to other functions.

fastSample1980 <- function(all, n, w){   tws <- w   for(i in (length(tws) - 1):0)     tws[1 + i] <- sum(tws[1 + i], tws[1 + 2 * i + 1],                        tws[1 + 2 * i + 2], na.rm = TRUE)         out <- numeric(n)   for(i in 1:n){     gas <- tws[1] * runif(1)     k <- 0             while(gas > w[1 + k]){       gas <- gas - w[1 + k]       k <- 2 * k + 1       if(gas > tws[1 + k]){         gas <- gas - tws[1 + k]         k <- k + 1       }     }     wgh <- w[1 + k]     out[i] <- all[1 + k]             w[1 + k] <- 0     while(1 + k >= 1){       tws[1 + k] <- tws[1 + k] - wgh       k <- floor((k - 1) / 2)     }   }   out } 

Rcpp implementation of the algorithm by Wong and Easton. Possibly it can be optimized even more since this is my first usable Rcpp function, but anyway it works well.

library(inline) library(Rcpp)  src <- ' Rcpp::NumericVector weights = Rcpp::clone<Rcpp::NumericVector>(w); Rcpp::NumericVector tws = Rcpp::clone<Rcpp::NumericVector>(w); Rcpp::NumericVector x = Rcpp::NumericVector(all); int k, num = as<int>(n); Rcpp::NumericVector out(num); double gas, wgh;  if((weights.size() - 1) % 2 == 0){   tws[((weights.size()-1)/2)] += tws[weights.size()-1] + tws[weights.size()-2]; } else {   tws[floor((weights.size() - 1)/2)] += tws[weights.size() - 1]; }  for (int i = (floor((weights.size() - 1)/2) - 1); i >= 0; i--){   tws[i] += (tws[2 * i + 1]) + (tws[2 * i + 2]); } for(int i = 0; i < num; i++){   gas = as<double>(runif(1)) * tws[0];   k = 0;   while(gas > weights[k]){     gas -= weights[k];     k = 2 * k + 1;     if(gas > tws[k]){       gas -= tws[k];       k += 1;     }   }   wgh = weights[k];   out[i] = x[k];   weights[k] = 0;   while(k > 0){     tws[k] -= wgh;     k = floor((k - 1) / 2);   }   tws[0] -= wgh; } return out; '  fun <- cxxfunction(signature(all = "numeric", n = "integer", w = "numeric"),                    src, plugin = "Rcpp") 

Now some results:

times1 <- ldply(   1:6,   function(i) {     n <- 1024 * (2 ** i)     p <- runif(2 * n) # Uniform distribution     p <- p/sum(p)     data.frame(       n=n,       user=c(system.time(sample.int.test(n, p), gcFirst=T)['user.self'],              system.time(weighted_Random_Sample(1:(2*n), p, n), gcFirst=T)['user.self'],              system.time(fun(1:(2*n), n, p), gcFirst=T)['user.self'],              system.time(sample.int.rej(2*n, n, p), gcFirst=T)['user.self'],              system.time(fastSampleReject(1:(2*n), n, p), gcFirst=T)['user.self'],              system.time(fastSample1980(1:(2*n), n, p), gcFirst=T)['user.self']),       id=c("Base", "Reservoir", "Rcpp", "Rejection", "Rejection simple", "1980"))   },   .progress='text' )   times2 <- ldply(   1:6,   function(i) {     n <- 1024 * (2 ** i)     p <- runif(2 * n - 1)     p <- p/sum(p)      p <- c(0.999, 0.001 * p) # Special case     data.frame(       n=n,       user=c(system.time(sample.int.test(n, p), gcFirst=T)['user.self'],              system.time(weighted_Random_Sample(1:(2*n), p, n), gcFirst=T)['user.self'],              system.time(fun(1:(2*n), n, p), gcFirst=T)['user.self'],              system.time(sample.int.rej(2*n, n, p), gcFirst=T)['user.self'],              system.time(fastSampleReject(1:(2*n), n, p), gcFirst=T)['user.self'],              system.time(fastSample1980(1:(2*n), n, p), gcFirst=T)['user.self']),       id=c("Base", "Reservoir", "Rcpp", "Rejection", "Rejection simple", "1980"))   },   .progress='text' ) 

enter image description here

enter image description here

arrange(times1, id)        n  user               id 1   2048  0.53             1980 2   4096  0.94             1980 3   8192  2.00             1980 4  16384  4.32             1980 5  32768  9.10             1980 6  65536 21.32             1980 7   2048  0.02             Base 8   4096  0.05             Base 9   8192  0.18             Base 10 16384  0.75             Base 11 32768  2.99             Base 12 65536 12.23             Base 13  2048  0.00             Rcpp 14  4096  0.01             Rcpp 15  8192  0.03             Rcpp 16 16384  0.07             Rcpp 17 32768  0.14             Rcpp 18 65536  0.31             Rcpp 19  2048  0.00        Rejection 20  4096  0.00        Rejection 21  8192  0.00        Rejection 22 16384  0.02        Rejection 23 32768  0.02        Rejection 24 65536  0.03        Rejection 25  2048  0.00 Rejection simple 26  4096  0.01 Rejection simple 27  8192  0.00 Rejection simple 28 16384  0.01 Rejection simple 29 32768  0.00 Rejection simple 30 65536  0.05 Rejection simple 31  2048  0.00        Reservoir 32  4096  0.00        Reservoir 33  8192  0.00        Reservoir 34 16384  0.02        Reservoir 35 32768  0.03        Reservoir 36 65536  0.05        Reservoir  arrange(times2, id)        n  user               id 1   2048  0.43             1980 2   4096  0.93             1980 3   8192  2.00             1980 4  16384  4.36             1980 5  32768  9.08             1980 6  65536 19.34             1980 7   2048  0.01             Base 8   4096  0.04             Base 9   8192  0.18             Base 10 16384  0.75             Base 11 32768  3.11             Base 12 65536 12.04             Base 13  2048  0.01             Rcpp 14  4096  0.02             Rcpp 15  8192  0.03             Rcpp 16 16384  0.08             Rcpp 17 32768  0.15             Rcpp 18 65536  0.33             Rcpp 19  2048  0.00        Rejection 20  4096  0.00        Rejection 21  8192  0.02        Rejection 22 16384  0.02        Rejection 23 32768  0.05        Rejection 24 65536  0.08        Rejection 25  2048  1.43 Rejection simple 26  4096  2.87 Rejection simple 27  8192  6.17 Rejection simple 28 16384 13.68 Rejection simple 29 32768 29.74 Rejection simple 30 65536 73.32 Rejection simple 31  2048  0.00        Reservoir 32  4096  0.00        Reservoir 33  8192  0.02        Reservoir 34 16384  0.02        Reservoir 35 32768  0.02        Reservoir 36 65536  0.04        Reservoir 

Obviously we can reject function 1980 because it is slower than Base in both cases. Rejection simple gets into trouble too when there is a single probability 0.999 in the second case.

So there remains Rejection, Rcpp, Reservoir. The last step is checking whether the values themselves are correct. To be sure about them, we will be using sample as a benchmark (also to eliminate the confusion about probabilities which do not have to coincide with p because of sampling without replacement).

p <- c(995/1000, rep(1/1000, 5)) n <- 100000  system.time(print(table(replicate(sample(1:6, 3, repl = FALSE, prob = p), n = n))/n))       1       2       3       4       5       6  1.00000 0.39992 0.39886 0.40088 0.39711 0.40323  # Benchmark    user  system elapsed     1.90    0.00    2.03   system.time(print(table(replicate(sample.int.rej(2*3, 3, p), n = n))/n))       1       2       3       4       5       6  1.00000 0.40007 0.40099 0.39962 0.40153 0.39779     user  system elapsed    76.02    0.03   77.49 # Slow  system.time(print(table(replicate(weighted_Random_Sample(1:6, p, 3), n = n))/n))       1       2       3       4       5       6  1.00000 0.49535 0.41484 0.36432 0.36338 0.36211  # Incorrect    user  system elapsed     3.64    0.01    3.67   system.time(print(table(replicate(fun(1:6, 3, p), n = n))/n))       1       2       3       4       5       6  1.00000 0.39876 0.40031 0.40219 0.40039 0.39835     user  system elapsed     4.41    0.02    4.47  

Notice a few things here. For some reason weighted_Random_Sample returns incorrect values (I have not looked into it at all, but it works correct assuming uniform distribution). sample.int.rej is very slow in repeated sampling.

In conclusion, it seems that Rcpp is the optimal choice in case of repeated sampling while sample.int.rej is a bit faster otherwise and also easier to use.

like image 124
Julius Vainora Avatar answered Oct 11 '22 13:10

Julius Vainora