Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pick N distinct items at random from sequence of unknown length, in only one iteration

I am trying to write an algorithm that would pick N distinct items from an sequence at random, without knowing the size of the sequence in advance, and where it is expensive to iterate over the sequence more than once. For example, the elements of the sequence might be the lines of a huge file.

I have found a solution when N=1 (that is, "pick exactly one element at random from a huge sequence"):

import random items = range(1, 10) # Imagine this is a huge sequence of unknown length count = 1 selected = None for item in items:     if random.random() * count < 1:         selected = item     count += 1 

But how can I achieve the same thing for other values of N (say, N=3)?

like image 284
akonsu Avatar asked Mar 13 '12 18:03

akonsu


1 Answers

If your sequence is short enough that reading it into memory and randomly sorting it is acceptable, then a straightforward approach would be to just use random.shuffle:

import random arr=[1,2,3,4]  # In-place shuffle random.shuffle(arr)  # Take the first 2 elements of the now randomized array print arr[0:2] [1, 3] 

Depending upon the type of your sequence, you may need to convert it to a list by calling list(your_sequence) on it, but this will work regardless of the types of the objects in your sequence.

Naturally, if you can't fit your sequence into memory or the memory or CPU requirements of this approach are too high for you, you will need to use a different solution.

like image 167
Carl Bellingan Avatar answered Sep 29 '22 08:09

Carl Bellingan