In another thread, I saw the time complexity of a binary-heap weighted random sample is equal to O(n * log(m)) where n is the number of choices and m is the number of nodes to choose from.
I was wondering about the time complexity of an unweighted random sample which is used by Python as random.sample. Is the time complexity simply O(n) or is it something else entirely?
Python source: random.py
(line 267).
Here's the relevant bit:
315 selected = set()
316 selected_add = selected.add
317 for i in range(k):
318 j = randbelow(n)
319 while j in selected:
320 j = randbelow(n)
321 selected_add(j)
322 result[i] = population[j]
It basically "rolls the dice" for a random index into population
. If it gets an index that's already in the set selected
, it re-rolls. Rinse, lather and repeat k
times (where k
is the number of samples you asked for.)
It appears to be O(n)
in the size of the requested number of samples. There are some optimisations for small sets, but the meat of the thing is the main loop above.
Edit:
I believe line 305-313 are a special case for when the number of samples requested, k
, is a large proportion of the total population n
. Instead of rolling for random elements from the entire population (and re-rolling if we collide with an element we already selected), we explicitly maintain a list of elements we have yet to select. We are guaranteed to get a new element every time, but the tradeoff is that we have to maintain the list.
If I'm interpreting this wrong, feel free to comment below.
303 result = [None] * k
304 setsize = 21 # size of a small set minus size of an empty list
305 if k > 5:
306 setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
307 if n <= setsize:
308 # An n-length list is smaller than a k-length set
309 pool = list(population)
310 for i in range(k): # invariant: non-selected at [0,n-i)
311 j = randbelow(n-i)
312 result[i] = pool[j]
313 pool[j] = pool[n-i-1] # move non-selected item into vacancy
314 else:
315 selected = set()
316 selected_add = selected.add
317 for i in range(k):
318 j = randbelow(n)
319 while j in selected:
320 j = randbelow(n)
321 selected_add(j)
322 result[i] = population[j]
323 return result
Source : https://hg.python.org/cpython/file/ab500b297900/Lib/random.py
This answer is similar to the one of Li-aung Yip above, but I think the precisions can be important for certain applications.
With a population of size n
and a sample of size k
, the complexity can be affected by the type of the input for the population : If the population is a set, line 296 of random.py copies it to a tuple, which is O(n)
, no matter what is n
or k
. If it is a sequence, no pre-treatment is necessary.
Afterwards, if k
is large enough (see below), random.sample uses the following loop (line 305-313 of random.py), which is O(n)
again because of the copy of the population.
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
if n <= setsize:
# An n-length list is smaller than a k-length set
pool = list(population)
for i in range(k): # invariant: non-selected at [0,n-i)
j = randbelow(n-i)
result[i] = pool[j]
pool[j] = pool[n-i-1] # move non-selected item into vacancy
The only hope for an average-case complexity different from O(n)
is when k
is sufficiently small such that the following loop is used :
else:
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow(n)
while j in selected:
j = randbelow(n)
selected_add(j)
result[i] = population[j]
which seems to be O(k)
in average-case, but I'm not sure on how to prove it. Feel free to comment if I made a mistake.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With