Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Time complexity of random.sample

In another thread, I saw the time complexity of a binary-heap weighted random sample is equal to O(n * log(m)) where n is the number of choices and m is the number of nodes to choose from.

I was wondering about the time complexity of an unweighted random sample which is used by Python as random.sample. Is the time complexity simply O(n) or is it something else entirely?

like image 708
chrtan Avatar asked May 07 '12 13:05

chrtan


2 Answers

Python source: random.py (line 267).

Here's the relevant bit:

   315             selected = set()
   316             selected_add = selected.add
   317             for i in range(k):
   318                 j = randbelow(n)
   319                 while j in selected:
   320                     j = randbelow(n)
   321                 selected_add(j)
   322                 result[i] = population[j]

It basically "rolls the dice" for a random index into population. If it gets an index that's already in the set selected, it re-rolls. Rinse, lather and repeat k times (where k is the number of samples you asked for.)

It appears to be O(n) in the size of the requested number of samples. There are some optimisations for small sets, but the meat of the thing is the main loop above.


Edit:

I believe line 305-313 are a special case for when the number of samples requested, k, is a large proportion of the total population n. Instead of rolling for random elements from the entire population (and re-rolling if we collide with an element we already selected), we explicitly maintain a list of elements we have yet to select. We are guaranteed to get a new element every time, but the tradeoff is that we have to maintain the list.

If I'm interpreting this wrong, feel free to comment below.

   303         result = [None] * k
   304         setsize = 21        # size of a small set minus size of an empty list
   305         if k > 5:
   306             setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
   307         if n <= setsize:
   308             # An n-length list is smaller than a k-length set
   309             pool = list(population)
   310             for i in range(k):         # invariant:  non-selected at [0,n-i)
   311                 j = randbelow(n-i)
   312                 result[i] = pool[j]
   313                 pool[j] = pool[n-i-1]   # move non-selected item into vacancy
   314         else:
   315             selected = set()
   316             selected_add = selected.add
   317             for i in range(k):
   318                 j = randbelow(n)
   319                 while j in selected:
   320                     j = randbelow(n)
   321                 selected_add(j)
   322                 result[i] = population[j]
   323         return result
like image 158
Li-aung Yip Avatar answered Nov 20 '22 00:11

Li-aung Yip


Source : https://hg.python.org/cpython/file/ab500b297900/Lib/random.py

This answer is similar to the one of Li-aung Yip above, but I think the precisions can be important for certain applications.

With a population of size n and a sample of size k, the complexity can be affected by the type of the input for the population : If the population is a set, line 296 of random.py copies it to a tuple, which is O(n), no matter what is n or k. If it is a sequence, no pre-treatment is necessary.

Afterwards, if k is large enough (see below), random.sample uses the following loop (line 305-313 of random.py), which is O(n) again because of the copy of the population.

    if k > 5:
        setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
    if n <= setsize:
        # An n-length list is smaller than a k-length set
        pool = list(population)
        for i in range(k):         # invariant:  non-selected at [0,n-i)
            j = randbelow(n-i)
            result[i] = pool[j]
            pool[j] = pool[n-i-1]   # move non-selected item into vacancy

The only hope for an average-case complexity different from O(n) is when k is sufficiently small such that the following loop is used :

else:
    selected = set()
    selected_add = selected.add
    for i in range(k):
        j = randbelow(n)
        while j in selected:
            j = randbelow(n)
        selected_add(j)
        result[i] = population[j]

which seems to be O(k) in average-case, but I'm not sure on how to prove it. Feel free to comment if I made a mistake.

like image 21
Guillaume St-Onge Avatar answered Nov 20 '22 00:11

Guillaume St-Onge