Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unique random number sampling with Numpy

I need to create a 10,000 x 50 array in which each row contains an ascending series of random numbers between 1 and 365, like so:

[[  4  11  14 ..., 355 360 364]
 [  2  13  15 ..., 356 361 361]
 [  4  12  18 ..., 356 361 365]
 ..., 
 [  6   9  17 ..., 356 362 364]
 [  1  10  19 ..., 352 357 360]
 [  1   9  17 ..., 356 358 364]]

The only way I've figured out to do this is by way of an iterator:

sample_dates = np.array([np.sort(np.random.choice(365, 50, replace=False)) for _ in range(10000)])

which works, but is pretty slow (~0.33 seconds to run) and I'm going to be doing this thousands of times). Is there a faster way to accomplish this?

EDIT: From what I can tell, the most expensive part of this solution is the iteration and 10k individual calls to np.random.choice, not the sorting

like image 639
triphook Avatar asked Jul 11 '17 21:07

triphook


People also ask

How do I randomly sample from a NumPy array?

choice() function is used to get random elements from a NumPy array. It is a built-in function in the NumPy package of python. Parameters: a: a one-dimensional array/list (random sample will be generated from its elements) or an integer (random samples will be generated in the range of this integer)

Can NumPy generate random numbers?

Generate Random NumberNumPy offers the random module to work with random numbers.

Can NumPy generate sample data?

The random module from numpy offers a wide range ways to generate random numbers sampled from a known distribution with a fixed set of parameters.

Is NumPy random pseudo random?

Numpy's random number routines produce pseudo random numbers using combinations of a BitGenerator to create sequences and a Generator to use those sequences to sample from different statistical distributions: BitGenerators: Objects that generate random numbers.


2 Answers

The following solution does not use sort:

l = np.array([True]*50 + [False]*315)
total = np.arange(1,366)
sample_dates = np.array([total[np.random.permutation(l)] for _ in range(10000)])

Hence it seems to be faster than the other suggested solutions (takes 0.44 seconds on my computer versus 0.77 for "Nils Werner"'s solution. The OP's solution took 0.81 seconds).

like image 192
Miriam Farber Avatar answered Oct 26 '22 18:10

Miriam Farber


Considering the shapes of the arrays, I thought iterating on columns might provide some improvement. So my idea was to generate 10k numbers - with replacement. Then, on a loop, generate another 10k numbers and check for row-wise duplicates. If there are any, eliminate those and generate that many random numbers. This is also called hit and miss algorithm, if I remember correctly.

Here's the working code:

arr = np.random.choice(365, 10000)
for i in range(49):
    arr2 = np.random.choice(365, 10000)
    comp = (arr2 == arr)
    while comp.any():
        duplicate = comp if i==0 else comp.any(axis=0)
        arr2[duplicate] = np.random.choice(365, duplicate.sum())
        comp = (arr2 == arr)
    arr = np.vstack([arr, arr2])
arr = arr.T
arr.sort(axis=1)

This takes 93.4ms to complete. Your attempt takes 590ms on my computer so it provides ~6x improvement.

like image 24
ayhan Avatar answered Oct 26 '22 16:10

ayhan