I need to create a 10,000 x 50 array in which each row contains an ascending series of random numbers between 1 and 365, like so: <pre class="prettyprint"><code>[[ 4 11 14 ..., 355 360 364] [ 2 13 15 ..., 356 361 361] [ 4 12 18 ..., 356 361 365] ..., [ 6 9 17 ..., 356 362 364] [ 1 10 19 ..., 352 357 360] [ 1 9 17 ..., 356 358 364]] </code></pre> The only way I've figured out to do this is by way of an iterator: <pre class="prettyprint"><code>sample_dates = np.array([np.sort(np.random.choice(365, 50, replace=False)) for _ in range(10000)]) </code></pre> which works, but is pretty slow (~0.33 seconds to run) and I'm going to be doing this thousands of times). Is there a faster way to accomplish this? EDIT: From what I can tell, the most expensive part of this solution is the iteration and 10k individual calls to np.random.choice, not the sorting

The following solution does not use sort: <pre class="prettyprint"><code>l = np.array([True]*50 + [False]*315) total = np.arange(1,366) sample_dates = np.array([total[np.random.permutation(l)] for _ in range(10000)]) </code></pre> Hence it seems to be faster than the other suggested solutions (takes 0.44 seconds on my computer versus 0.77 for "Nils Werner"'s solution. The OP's solution took 0.81 seconds).

Considering the shapes of the arrays, I thought iterating on columns might provide some improvement. So my idea was to generate 10k numbers - with replacement. Then, on a loop, generate another 10k numbers and check for row-wise duplicates. If there are any, eliminate those and generate that many random numbers. This is also called hit and miss algorithm, if I remember correctly. Here's the working code: <pre class="prettyprint"><code>arr = np.random.choice(365, 10000) for i in range(49): arr2 = np.random.choice(365, 10000) comp = (arr2 == arr) while comp.any(): duplicate = comp if i==0 else comp.any(axis=0) arr2[duplicate] = np.random.choice(365, duplicate.sum()) comp = (arr2 == arr) arr = np.vstack([arr, arr2]) arr = arr.T arr.sort(axis=1) </code></pre> This takes 93.4ms to complete. Your attempt takes 590ms on my computer so it provides ~6x improvement.

Unique random number sampling with Numpy

Tags:

performance

python

numpy

I need to create a 10,000 x 50 array in which each row contains an ascending series of random numbers between 1 and 365, like so:

[[  4  11  14 ..., 355 360 364]
 [  2  13  15 ..., 356 361 361]
 [  4  12  18 ..., 356 361 365]
 ..., 
 [  6   9  17 ..., 356 362 364]
 [  1  10  19 ..., 352 357 360]
 [  1   9  17 ..., 356 358 364]]

The only way I've figured out to do this is by way of an iterator:

sample_dates = np.array([np.sort(np.random.choice(365, 50, replace=False)) for _ in range(10000)])

which works, but is pretty slow (~0.33 seconds to run) and I'm going to be doing this thousands of times). Is there a faster way to accomplish this?

EDIT: From what I can tell, the most expensive part of this solution is the iteration and 10k individual calls to np.random.choice, not the sorting

639

asked Jul 11 '17 21:07

triphook

2 Answers

The following solution does not use sort:

l = np.array([True]*50 + [False]*315)
total = np.arange(1,366)
sample_dates = np.array([total[np.random.permutation(l)] for _ in range(10000)])

Hence it seems to be faster than the other suggested solutions (takes 0.44 seconds on my computer versus 0.77 for "Nils Werner"'s solution. The OP's solution took 0.81 seconds).

192

answered Oct 26 '22 18:10

Miriam Farber

Considering the shapes of the arrays, I thought iterating on columns might provide some improvement. So my idea was to generate 10k numbers - with replacement. Then, on a loop, generate another 10k numbers and check for row-wise duplicates. If there are any, eliminate those and generate that many random numbers. This is also called hit and miss algorithm, if I remember correctly.

Here's the working code:

arr = np.random.choice(365, 10000)
for i in range(49):
    arr2 = np.random.choice(365, 10000)
    comp = (arr2 == arr)
    while comp.any():
        duplicate = comp if i==0 else comp.any(axis=0)
        arr2[duplicate] = np.random.choice(365, duplicate.sum())
        comp = (arr2 == arr)
    arr = np.vstack([arr, arr2])
arr = arr.T
arr.sort(axis=1)

This takes 93.4ms to complete. Your attempt takes 590ms on my computer so it provides ~6x improvement.

answered Oct 26 '22 16:10

ayhan

Related questions
                            
                                Population must be a sequence or set. For dicts, use list(d)
                            
                                PyPDF2 returning blank PDF after copy
                            
                                Python daemon threads and the "with" statement
                            
                                How to render a variable in a django template?
                            
                                Python opencv remove noise in image
                            
                                Kivy Python Right Click
                            
                                installing progressbar Python package
                            
                                Curl --data-binary equivalent in python-requests library
                            
                                Comparing slices in python
                            
                                join dataframes using parts of datetime index
                            
                                TkInter Frame doesn't load if another function is called
                            
                                What exactly is the variance on the parameters of SciPy curve fit? (Python)
                            
                                Checking to see if Gtk mainloop is running
                            
                                Python: Requests Proxies not working
                            
                                creating new columns in a data set based on values of a column using Regex
                            
                                seaborn boxplot x-axis as numbers, not labels
                            
                                Anaconda - Spyder is very slow to start on Windows 8 (checking for updates?)
                            
                                Find every two (non-overlapping) vowels inbetween consonants
                            
                                Plotly (Dash) tick label overwriting
                            
                                Why isn't urls.py generated with django-admin startapp mysite?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With