Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

memory efficient random number iterator without replacement

I feel like this one should be easy but after numerous searches and attempts I can't figure an answer out. Basically I have a very large number of items that I want to sample in a random order without replacement. In this case they are cells in a 2D array. The solution that I would use for a smaller array doesn't translate because it requires shuffling an in memory array. If the number I had to sample was small I could also just randomly sample items and keep a list of the values I'd tried. Unfortunately I often will have to sample a very large proportion of all the cells, as many as all.

What I'd like to create is an iterator using some combination of itertools, numpy and/or random that yields the next random cell (x and y indices). Another possible solution would be to create an iterator that would yield the next random number (without replacement) between 0 and (x_count * y_count) which I could map back to a cell location. Neither of which seems easily accomplished.

Thanks for any sugestions!

Here's my current solution.

import numpy as np
import itertools as itr
import random as rdm

#works great
x_count = 10
y_count = 5

#good luck!
#x_count = 10000
#y_count = 20000

x_indices = np.arange(x_count)
y_indices = np.arange(y_count)

cell_indices = itr.product(x_indices, y_indices)
list_cell_indices = list(cell_indices)
rdm.shuffle(list_cell_indices)

for i in range(25):
    print list_cell_indices[i]

So based on the current responses and my attempt to translate perl which I know nothing about, I'm understanding that the best I can do is the following:

import numpy as np
import itertools as itr
import random as rdm

x_count = 10000
y_count = 5000

sample_count = 10000
keep_probability = 0.01


tried_cells = set()
kept_cells = set()

while len(kept_cells) < sample_count:
    x = rdm.randint(0, x_count)
    y = rdm.randint(0, y_count)

    if (x, y) in tried_cells:
        pass
    else:
        tried_cells.add((x, y))
        keep = rdm.random() < keep_probability
        if keep:
            kept_cells.add((x,y))


print "worked"

In most cases the processing time and memory used isn't that bad. Maybe I could do a check of the average cell keep_probability and sample_count and throw an error for difficult cases.

like image 273
Colin Talbert Avatar asked Jan 17 '23 14:01

Colin Talbert


1 Answers

I believe that there is simply no way to sample a sequence without replacement without using a significant amount of auxiliary storage for sample sizes close to R * C. And while there are clever ways to reduce the amount of storage for small sample sizes, if you expect to be sampling more than about a third of your dataset, you're better off just creating a separate list. random.sample is a natural choice for this purpose; and frankly I would just pass a flattened version of your 2-d numpy array straight to it. (Unless you need the indices too, in which case, randomly sampling ints and translating them into coordinates, a la hexparrot's solution, is a reasonable way to go.)

>>> a = numpy.arange(25).reshape((5, 5))
>>> random.sample(a.ravel(), 5)
[0, 13, 8, 18, 4]

If you look at the implementation of random.sample, you'll see that for smaller sample sizes, it already does roughly what the perl code above does -- tracks previously selected items in a set and discards selections that are already in the set. For larger sample sizes, it creates a copy of the input -- which is more memory efficient than a set for larger values, because sets take up more space than lists per stored item -- and does a slightly modified Fisher-Yates shuffle, stopping when it has sample_size items (i.e. it doesn't shuffle the whole list, so it's more efficient than shuffling the whole thing yourself.)

Basically, my bet is that you won't do better than random.sample, rolling your own, unless you code something up in c.

However -- I did find this, which you might find interesting: numpy.random.choice. This appears to do random sampling with or without replacement at c speed. The catch? It's new with Numpy 1.7!

like image 82
senderle Avatar answered Jan 30 '23 05:01

senderle