Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating random sublist from ordered list that maintains ordering

Consider a problem where a random sublist of k items, Y, must be selected from X, a list of n items, where the items in Y must appear in the same order as they do in X. The selected items in Y need not be distinct. One solution is this:

for i = 1 to k
    A[i] = floor(rand * n) + 1
    Y[i] = X[A[i]]
sort Y according to the ordering of A

However, this has running time O(k log k) due to the sort operation. To remove this it's tempting to

high_index = n
for i = 1 to k
    index = floor(rand * high_index) + 1
    Y[k - i + 1] = X[index]
    high_index = index

But this gives a clear bias to the returned list due to the uniform index selection. It feels like a O(k) solution is attainable if the indices in the second solution were distributed non-uniformly. Does anyone know if this is the case, and if so what properties the distribution the marginal indices are drawn from has?

like image 930
timxyz Avatar asked Apr 11 '12 11:04

timxyz


1 Answers

Unbiased O(n+k) solution is trivial, high-level pseudo code.

  • create an empty histogram of size n [initialized with all elements as zeros]
  • populate it with k uniformly distributed variables at range. (do k times histogram[inclusiveRand(1,n)]++)
  • iterate the initial list [A], while decreasing elements in the histogram and appending elements to the result list.

Explanation [edit]:

  • The idea is to chose k elements out of n at random, with uniform distribution for each, and create a histogram out of it.
  • This histogram now contains for each index i, how many times A[i] will appear in the resulting Y list.
  • Now, iterate the list A in-order, and for each element i, insert A[i] into the resulting Y list histogram[i] times.
  • This guarantees you maintain the order because you insert elements in order, and "never go back".
  • It also guarantees unbiased solution since for each i,j,K: P(histogram[i]=K) = P(histogram[j]=K), so for each K, each element has the same probability to appear in the resulting list K times.

I believe it can be done in O(k) using the order statistics [X(i)] but I cannot figure it out though :\

like image 187
amit Avatar answered Nov 15 '22 12:11

amit