Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lazily sample random results in python

Python question. I'm generating a large array of objects, which I only need to make a small random sample. Actually generating the objects in question takes a while, so I wonder if it would be possible to somehow skip over those objects that don't need generating and only explicitly create those objects that have been sampled.

In other words, I now have

a = createHugeArray()
s = random.sample(a,len(a)*0.001)

which is rather wasteful. I would prefer something more lazy like

a = createArrayGenerator()
s = random.sample(a,len(a)*0.001)

I don't know if this works. The documentation on random.sample isn't too clear, though it mentions xrange as being very fast - which makes me believe it might work. Converting the array creation to a generator would be a bit of work (my knowledge of generators is very rusty), so I want to know if this works in advance. :)

An alternative I can see is to make a random sample via xrange, and only generate those objects that are actually selected by index. That's not very clean though, because the indices generated are arbitrary and unnecessary, and I would need rather hacky logic to support this in my generateHugeArray method.

For bonus points: how does random.sample actually work? Especially, how does it work if it doesn't know the size of the population in advance, as with generators like xrange?

like image 326
Verhoevenv Avatar asked Nov 26 '10 16:11

Verhoevenv


1 Answers

There does not seem a way that avoids figuring out how the indices map to your permutations. If you don't know this, how would you create a random object from your array? You could either use the trick using xrange() you suggested yourself, or implement a class defining the __getitem__() and __len__() methods and pass and object of this class as population argument to random.sample().

Some further comments:

  • Converting createHugeArray() into a generator won't buy you anything -- random.sample() will just not work anymore. It needs an object supporting len().

  • So it does need to know the number of elements in the population right from the beginning.

  • The implementation features two different algorithms and chooses the one that will use less memory. For relatively small k (that is, in the case at hand) it will simply save the indices already chosen in a set and make a new random choice if it hits one of them.

Edit: A completely different approach would be to iterate over all permutations once and decide for each permutation if it should be included. If the total number of permutations is n and you would like to select k of them, you could write

selected = []
for i in xrange(n):
    perm = nextPermutation()
    if random.random() < float(k-len(selected))/(n-i):
        selected.append(perm)

This would choose exactly k permutations randomly.

like image 58
Sven Marnach Avatar answered Oct 10 '22 06:10

Sven Marnach