My use case is a bit specific. I want to sample 2 items without replacement from a list/array (of 50, or 100 elements). So I don't have to worry about arrays of sizes of 10^4 or 10^5 or multidimensional data.
I want to know
numpy.random.choice()
or numpy.random.shuffle()
is faster for this purpose, and why?For Question 1, I tried timing both functions (code below), and the shuffle method seems to about 5-6 times faster. Any insight you can give about this is most welcome. If there are faster ways to achieve my purpose, I will be glad to hear about them (I had looked at the options of python random
module, but the fastest method from my testing was using np.random.shuffle()
).
def shuffler(size, num_samples):
items = list(range(size))
np.random.shuffle(items)
return items[:num_samples]
def chooser(size, num_samples):
return np.random.choice(size, num_samples, replace=False)
%timeit shuffler(50, 2)
#> 1.84 µs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit chooser(50, 2)
#> 13 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
You may think it's already optimized and I am wasting time trying to save pennies. But np.random.choice()
is called 5000000 times in my code and takes about 8% of my runtime. It is being used in a loop to obtain 2 random samples from the population for each iteration.
Pseudocode:
for t in range(5000000):
# Random sample of 2 from the population without replacement.
If there is a smarter implementations for my requirement, I am open to suggestions.
PS: I am aware that shuffle
performs in place operation, but as I just require the indices of the two random elements I do not essentially have to perform it on my original array. There are other questions that compares the two functions from python random
module. But I require 2 samples without replacement.
To answer your questions:
shuffle
seems to be the fastest implementationLet's start @SvenMarnach's answer here. This is not a dupe of that question, but the answer is useful. Unfortunately that answer doesn't bring it in line with shuffler
timewise:
%timeit shuffler(50, 2)
2.47 µs ± 180 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit chooser(50, 2)
52.5 µs ± 3.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
rng = np.random.default_rng()
def chooser2(size, num_samples):
return rng.choice(size, num_samples, replace=False)
%timeit chooser2(50, 2)
15.9 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
the random.sample
answer is better:
import random
def sampler(size, num_samples):
return np.array(random.sample(range(size), num_samples))
%timeit sampler(50, 2)
4.6 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Still slower, though.
Since I'm not able to parse c
code, I'll take sven's word for it that random.choice
is doing shuffle-and-split in the background, so the methods should be equivalent. Why it's so much faster here is baffling to me though.
EDIT: A sample_indices
based on @DaniMesejo's answer (slightly slower for num_samples = 2
):
def sample_indices(pop, size, num_samples):
arr = np.random.rand(pop, size)
return np.argpartition(arr, num_samples, axis = 1)[:, :num_samples]
See the source code for numpy.random.choice
; with replace=False
it creates a 50-item temporary list, shuffles that list, and takes two items from that list.
Since version 1.17, the implementation decisions for numpy.random.choice
and numpy.random.shuffle
, as with other numpy.random
functions, can't be changed without affecting backward compatibility (see the recent RNG policy for NumPy). See also these questions:
Compare numpy.random.choice
with numpy.random.Generator.choice
, which is the newer way to sample items in NumPy 1.17 and later. The advantage is that numpy.random.Generator.choice
is not subject to the same compatibility guarantee as numpy.random.choice
or numpy.random.shuffle
. If you care about performance of numpy.random.Generator
you can file an issue in NumPy's GitHub repository.
You could use an alternative solution, the idea is to generate a random array and then find the position of the min and max:
import numpy as np
def sample_indices(ran, size):
arr = np.random.rand(ran, size)
mi = np.argmin(arr, axis=1).reshape((-1, 1))
ma = np.argmax(arr, axis=1).reshape((-1, 1))
return np.hstack((mi, ma))
def shuffler(size, num_samples):
items = list(range(size))
np.random.shuffle(items)
return items[:num_samples]
def chooser(size, num_samples):
return np.random.choice(size, num_samples, replace=False)
def sample_indices_shuffler(ran, size):
return np.array([shuffler(size, 2) for _ in range(ran)])
def sample_indices_chooser(ran, size):
return np.array([chooser(size, 2) for _ in range(ran)])
Here are the timings:
%timeit sample_indices_chooser(1000, 50)
17.3 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit sample_indices_shuffler(1000, 50)
2.69 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit sample_indices(1000, 50)
553 µs ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For:
res = sample_indices(10, 50)
print(res)
Output
[[ 9 6]
[31 42]
[17 42]
[24 45]
[ 2 49]
[27 31]
[21 19]
[ 7 16]
[20 28]
[32 36]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With