The stream provides the following operations:
class Stream:
def __init__(self, data):
self.data = list(data)
def read(self):
if not self.data:
return None
head, *self.data = self.data
return head
def peek(self):
return self.data[0] if self.data else None
The elements in the stream (ergo the elements of data
) are of constant size and neither of them is None
, so None
signals end of stream. The length of stream can only be learned by consuming the entire stream. And note that counting the number of elements consumes O(log n) space.
I believe there is no way to uniformly choose an item from the stream at random using O(1) space.
Can anyone (dis)prove this?
Generate a random number for each element, and remember the element with the smallest number.
That's the answer I like best, but the answer you're probably looking for is:
If the stream is N items long, then the probability of returning the Nth item is 1/N. Since this probability is different for every N, any machine that can accomplish this task must enter different states after reading streams of different lengths. Since the number of possible lengths is unbounded, the required number of possible states is unbounded, and the machine will require an unbounded amount of memory to distinguish between them.
In constant space? Sure, Reservoir Sampling, constant space, linear time
Some lightly tested code
import numpy as np
def stream(size):
for k in range(size):
yield k
def resSample(ni, s):
ret = np.empty(ni, dtype=np.int64)
k = 0
for k, v in enumerate(s):
if k < ni:
ret[k] = v
else:
idx = np.random.randint(0, k+1)
if (idx < ni):
ret[idx] = v
return ret
SIZE = 12
s = stream(SIZE)
q = resSample(1, s)
print(q)
I see there is a question wrt RNG. Suppose I have true RNG, hardware device which returns single bit at a time. We use it only in the code where get index.
if (idx < ni):
The only way condition would be triggered for one element to be select
is when ni=1
and thus idx
only could be ZERO.
Thus np.random.randint(0, k+1) with such implementation would be something like
def trng(k):
for _ in range(k+1):
if next_true_bit():
return 1 # as soon as it is not 0, we don't care
return 0 # all bits are zero, index is zero, proceed with exchange
QED, such realization is possible and therefore this sampling method shall work
UPDATE
@kyrill is probably right - I have to have a count going (log2(k) storage), so far see no way to avoid it. Even with RNG trick, I have to sample 0 with probability 1/k
and this k
is growing with the size of the stream.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With