I need to select k
elements randomly in the range 0 to n-1
. n
can be upto 10^9. And k
can range from 1 to n-1
. I can do this in O(n) time just by shuffling an array containing values 0 to n-1
and choosing first k
elements from it. But when k
is small this method is both time and memory inefficient. Is there any O(k) solution to this problem?
Note : Selected k
numbers must be distinct.
I'm thinking of a solution. There are two approaches to this I can think of. Let R
is the set to be returned.
R
. Keep on doing this until |R| = k
. This process takes sum(n/i) for n+1-k <= i <= n
time and O(k) space.k
elements from it. This process takes O(n+k) time and space.So for a given k
I can select the preferable method in O(k) time.
The shuffle solution can be improved, since you only have to shuffle the first k elements of the array. But that's still O(n) because the naïve shuffle implementation requires an array of size n, which needs to be initialized to the n values from 0 to n-1.
Initialize value[n] to {0..n-1}
For i from 0 to k-1:
swap(value[i], value[random_in_range(i, n)])
Result is value[0..k-1]
To improve on that, we can use a kind of virtual array, consisting of two parts:
value: An array of the first k elements, which will be the resulting selection. This is initialized to {0..k-1}
rest: A sparse datastructure (a hashtable, for example) with capacity of k entries, containing all the remaining entries of the array which are not simply their own index. Initially empty.
Now we can define the function which swaps elements i and j from the value array (Note: i<k, as is guaranteed by the above algorithm):
# To swap elements i and j
If j < k:
# Both elements to be swapped are in the selection
tmp = value[i]; value[i] = value[j]; value[j] = tmp
Else If j in rest:
# Element j has been swapped before
tmp = value[i]; value[i] = rest[j]; rest[j] = tmp
Else:
# The value at j is still j, we now add it to the virtual array
rest[j] = value[i]; value[i] = j
That uses O(k) space and time, for any value of k≤n.
A simpler solution using O(k) memory is to just keep a hashtable of all the selected values, and generate values until the hashtable contains k values, rejecting duplicates.
For small k, the probability of a randomly-selected element being a duplicate is insignificant and the naïve hashtable is certainly the simplest solution. On the other hand, if k is a significant fraction of n, the hash table is mostly wasted space, since a simple bit vector of size n would be be sufficient to record which values have been seen. And for very large k, the rejection algorithm will take too much time as the sample fills up, and the full vector needed for the shuffle is not much more space than the vector that would be used to hold the sample.
As a result, the pragmatic solution is probably to use whichever of the three solutions uses less space and time: For values of k sufficiently large that the n-bit bitvector is smaller than a hash table with k entries but not so large that the probability of rejection is significant (say, n/64≤k≤n/4), use the bitvector. For smaller values of k, use the simple hashtable algorithm, and for values of k close to n , use a Fisher-Yates shuffle on the full n-element vector (but limited to k steps).
Since we choose the O(n) strategies only in cases where k>cn for some constant c, the composite algorithm is still O(k) time and space because with that constraint, n is in O(k).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With