Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select random item from stream in O(1) space

Select an item from a stream at random with uniform probability, using constant space

The stream provides the following operations:

class Stream:

  def __init__(self, data):
    self.data = list(data)

  def read(self):
    if not self.data:
      return None

    head, *self.data = self.data
    return head

  def peek(self):
    return self.data[0] if self.data else None

The elements in the stream (ergo the elements of data) are of constant size and neither of them is None, so None signals end of stream. The length of stream can only be learned by consuming the entire stream. And note that counting the number of elements consumes O(log n) space.

I believe there is no way to uniformly choose an item from the stream at random using O(1) space.

Can anyone (dis)prove this?

like image 873
Борат Сагдиев Avatar asked Jan 26 '23 11:01

Борат Сагдиев


2 Answers

Generate a random number for each element, and remember the element with the smallest number.

That's the answer I like best, but the answer you're probably looking for is:

If the stream is N items long, then the probability of returning the Nth item is 1/N. Since this probability is different for every N, any machine that can accomplish this task must enter different states after reading streams of different lengths. Since the number of possible lengths is unbounded, the required number of possible states is unbounded, and the machine will require an unbounded amount of memory to distinguish between them.

like image 69
Matt Timmermans Avatar answered Jan 30 '23 04:01

Matt Timmermans


In constant space? Sure, Reservoir Sampling, constant space, linear time

Some lightly tested code

import numpy as np

def stream(size):
    for k in range(size):
        yield k

def resSample(ni, s):
    ret = np.empty(ni, dtype=np.int64)

    k = 0
    for k, v in enumerate(s):
        if k < ni:
            ret[k] = v
        else:
            idx = np.random.randint(0, k+1)
            if (idx < ni):
                ret[idx] = v

    return ret

SIZE = 12

s = stream(SIZE)
q = resSample(1, s)
print(q)

I see there is a question wrt RNG. Suppose I have true RNG, hardware device which returns single bit at a time. We use it only in the code where get index.

if (idx < ni):

The only way condition would be triggered for one element to be select is when ni=1 and thus idx only could be ZERO.

Thus np.random.randint(0, k+1) with such implementation would be something like

def trng(k):
    for _ in range(k+1):
        if next_true_bit():
            return 1 # as soon as it is not 0, we don't care
    return 0 # all bits are zero, index is zero, proceed with exchange

QED, such realization is possible and therefore this sampling method shall work

UPDATE

@kyrill is probably right - I have to have a count going (log2(k) storage), so far see no way to avoid it. Even with RNG trick, I have to sample 0 with probability 1/k and this k is growing with the size of the stream.

like image 39
Severin Pappadeux Avatar answered Jan 30 '23 04:01

Severin Pappadeux