I feel like this one should be easy but after numerous searches and attempts I can't figure an answer out. Basically I have a very large number of items that I want to sample in a random order without replacement. In this case they are cells in a 2D array. The solution that I would use for a smaller array doesn't translate because it requires shuffling an in memory array. If the number I had to sample was small I could also just randomly sample items and keep a list of the values I'd tried. Unfortunately I often will have to sample a very large proportion of all the cells, as many as all.
What I'd like to create is an iterator using some combination of itertools, numpy and/or random that yields the next random cell (x and y indices). Another possible solution would be to create an iterator that would yield the next random number (without replacement) between 0 and (x_count * y_count) which I could map back to a cell location. Neither of which seems easily accomplished.
Thanks for any sugestions!
Here's my current solution.
import numpy as np
import itertools as itr
import random as rdm
#works great
x_count = 10
y_count = 5
#good luck!
#x_count = 10000
#y_count = 20000
x_indices = np.arange(x_count)
y_indices = np.arange(y_count)
cell_indices = itr.product(x_indices, y_indices)
list_cell_indices = list(cell_indices)
rdm.shuffle(list_cell_indices)
for i in range(25):
print list_cell_indices[i]
So based on the current responses and my attempt to translate perl which I know nothing about, I'm understanding that the best I can do is the following:
import numpy as np
import itertools as itr
import random as rdm
x_count = 10000
y_count = 5000
sample_count = 10000
keep_probability = 0.01
tried_cells = set()
kept_cells = set()
while len(kept_cells) < sample_count:
x = rdm.randint(0, x_count)
y = rdm.randint(0, y_count)
if (x, y) in tried_cells:
pass
else:
tried_cells.add((x, y))
keep = rdm.random() < keep_probability
if keep:
kept_cells.add((x,y))
print "worked"
In most cases the processing time and memory used isn't that bad. Maybe I could do a check of the average cell keep_probability and sample_count and throw an error for difficult cases.
I believe that there is simply no way to sample a sequence without replacement without using a significant amount of auxiliary storage for sample sizes close to R * C
. And while there are clever ways to reduce the amount of storage for small sample sizes, if you expect to be sampling more than about a third of your dataset, you're better off just creating a separate list. random.sample
is a natural choice for this purpose; and frankly I would just pass a flattened version of your 2-d numpy array straight to it. (Unless you need the indices too, in which case, randomly sampling ints and translating them into coordinates, a la hexparrot's solution, is a reasonable way to go.)
>>> a = numpy.arange(25).reshape((5, 5))
>>> random.sample(a.ravel(), 5)
[0, 13, 8, 18, 4]
If you look at the implementation of random.sample
, you'll see that for smaller sample sizes, it already does roughly what the perl code above does -- tracks previously selected items in a set and discards selections that are already in the set. For larger sample sizes, it creates a copy of the input -- which is more memory efficient than a set for larger values, because sets take up more space than lists per stored item -- and does a slightly modified Fisher-Yates shuffle, stopping when it has sample_size
items (i.e. it doesn't shuffle the whole list, so it's more efficient than shuffling the whole thing yourself.)
Basically, my bet is that you won't do better than random.sample
, rolling your own, unless you code something up in c.
However -- I did find this, which you might find interesting: numpy.random.choice
. This appears to do random sampling with or without replacement at c speed. The catch? It's new with Numpy 1.7!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With