Consider a problem where a random sublist of k items, Y, must be selected from X, a list of n items, where the items in Y must appear in the same order as they do in X. The selected items in Y need not be distinct. One solution is this:
for i = 1 to k
A[i] = floor(rand * n) + 1
Y[i] = X[A[i]]
sort Y according to the ordering of A
However, this has running time O(k log k) due to the sort operation. To remove this it's tempting to
high_index = n
for i = 1 to k
index = floor(rand * high_index) + 1
Y[k - i + 1] = X[index]
high_index = index
But this gives a clear bias to the returned list due to the uniform index selection. It feels like a O(k) solution is attainable if the indices in the second solution were distributed non-uniformly. Does anyone know if this is the case, and if so what properties the distribution the marginal indices are drawn from has?
Unbiased O(n+k)
solution is trivial, high-level pseudo code.
histogram[inclusiveRand(1,n)]++
)Explanation [edit]:
k
elements out of n
at random, with uniform
distribution for each, and create a histogram out of it.i
, how many times A[i]
will appear in the resulting Y
list.A
in-order, and for each element i
, insert A[i]
into the resulting Y
list histogram[i]
times.P(histogram[i]=K) = P(histogram[j]=K)
, so for each K
, each element has the same probability to appear in the resulting list K
times.I believe it can be done in O(k)
using the order statistics [X(i)] but I cannot figure it out though :\
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With