Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python random sample generator (comfortable with huge population sizes)

As you might know random.sample(population,sample_size) quickly returns a random sample, but what if you don't know in advance the size of the sample? You end up in sampling the entire population, or shuffling it, which is the same. But this can be wasteful (if the majority of sample sizes come up to be small compared to population size) or even unfeasible (if population size is huge, running out of memory). Also, what if your code needs to jump from here to there before picking the next element of the sample?

P.S. I bumped into the need of optimizing random sample while working on simulated annealing for TSP. In my code sampling is restarted hundreds of thousands of times, and each time I don't know if I will need to pick 1 element or the 100% of the elements of population.

like image 455
mmj Avatar asked May 03 '26 15:05

mmj


1 Answers

At first, I would split the population into blocks. The function to do the block sampling can easily be a generator, being able to process sample of arbitrary size. This also allows you to make the function a generator.

Imagine infinite population, a population block of 512 and sample size of 8. This means you could gather as many samples as you need, and for future reduction again sample the already sampled space (for 1024 blocks this means 8196 samples from which you can sample again).

At the same time, this allows for parallel processing which may be feasible in case of very large samples.

Example considering in-memory population

import random

population = [random.randint(0, 1000) for i in range(0, 150000)]

def sample_block(population, block_size, sample_size):
    block_number = 0
    while 1:
        try:
            yield random.sample(population[block_number * block_size:(block_number + 1) * block_size], sample_size)
            block_number += 1
        except ValueError:
            break

sampler = sample_block(population, 512, 8)
samples = []

try:
    while 1:
        samples.extend(sampler.next())
except StopIteration:
    pass

print random.sample(samples, 200)

If population was external to the script (file, block) the only modification is that you would have to load appropriate chunk to a memory. Proof of concept how sampling of infinite population could look like:

import random
import time

def population():
    while 1:
        yield random.randint(0, 10000)

def reduced_population(samples):
    for sample in samples:
        yield sample

def sample_block(generator, block_size, sample_size):

    block_number = 0
    block = []
    while 1:
        block.append(generator.next())
        if len(block) == block_size:
            s = random.sample(block, sample_size)
            block_number += 1
            block = []
            print 'Sampled block {} with result {}.'.format(block_number, s)
            yield s

samples = []
result = []
reducer = sample_block(population(), 512, 12)

try:
    while 1:
        samples.append(reducer.next())
        if len(samples) == 1000:
            sampler = sample_block(reduced_population(samples), 1000, 15)
            result.append(list(sampler))
            time.sleep(5)
except StopIteration:
    pass

Ideally, you'd also gather the samples and again sample them.

like image 144
mpolednik Avatar answered May 05 '26 03:05

mpolednik