Why is random.sample faster than numpy's random.choice?

Tags:

I need a way to sample without replacement a certain array a. I tried two approaches (see MCVE below), using random.sample() and np.random.choice.

I assumed the numpy function would be faster, but it turns out it is not. In my tests random.sample is ~15% faster than np.random.choice.

Is this correct, or am I doing something wrong in my example below? If this is correct, why?

import numpy as np
import random
import time
from contextlib import contextmanager


@contextmanager
def timeblock(label):
    start = time.clock()
    try:
        yield
    finally:
        end = time.clock()
        print ('{} elapsed: {}'.format(label, end - start))


def f1(a, n_sample):
    return random.sample(range(len(a)), n_sample)


def f2(a, n_sample):
    return np.random.choice(len(a), n_sample, replace=False)


# Generate random array
a = np.random.uniform(1., 100., 10000)
# Number of samples' indexes to randomly take from a
n_sample = 100
# Number of times to repeat functions f1 and f2
N = 100000

with timeblock("random.sample"):
    for _ in range(N):
        f1(a, n_sample)

with timeblock("np.random.choice"):
    for _ in range(N):
        f2(a, n_sample)

886

asked Dec 01 '16 15:12

Gabriel

1 Answers

TL;DR Since numpy v1.17.0 it's recommended to use numpy.random.default_rng() object instead of numpy.random. For choice:

import numpy as np

rng = np.random.default_rng()    # you can pass seed
rng.choice(...)    # interface is the same

Besides other changes with random API introduced in v1.17 this new version of choice is much smarter now and should be the fastest in most cases. The old version remains the same for backward compatibility!

As mentioned in the comments, there was a long-standing issue in numpy regarding np.random.choice implementation being ineffective for k << n compared to random.sample from python standard library.

The problem was np.random.choice(arr, size=k, replace=False) being implemented as a permutation(arr)[:k]. In case of a large array and a small k, computing the whole array permutation is a waste of time and memory. The standard python's random.sample works in a more straightforward way - it just iteratively samples without replacement either keeping track of what is already sampled or from what to sample.

In v1.17.0 numpy introduced rework and improvement of numpy.random package (docs, what's new, performance). I highly recommend to take a look at the first link at least. Note that, as it's said there, for backward compatibility the old numpy.random API remains the same - it keeps using an old implementations.

So the new recommended way to use random API is to use numpy.random.default_rng() object instead of numpy.random. Note that it's an object and it accepts optional seed argument as well so you can pass it around in a convenient way. It also uses a different generator by default that is faster in average (see performance link above for the details).

Regarding your case you may want to use np.random.default_rng().choice(...) now. Besides being faster, thanks to the improved random generator, the choice itself became smarter. Now it uses the whole array permutation only for both sufficiently large array (>10000 elements) and relatively large k (>1/50 of the size). Otherwise it uses Floyd's sampling algorithm (short description, numpy implementation).

Here's the performance comparison on my laptop:

100 samples from array of 10000 elements x 10000 times:

random.sample elapsed: 0.8711776689742692
np.random.choice elapsed: 1.9704092079773545
np.random.default_rng().choice elapsed: 0.818919860990718

1000 samples from array of 10000 elements x 10000 times:

random.sample elapsed: 8.785315042012371
np.random.choice elapsed: 1.9777243090211414
np.random.default_rng().choice elapsed: 1.05490942299366

10000 samples from array of 10000 elements x 10000 times:

random.sample elapsed: 80.15063399000792
np.random.choice elapsed: 2.0218082449864596
np.random.default_rng().choice elapsed: 2.8596064270241186

And the code I used:

import numpy as np
import random
from timeit import default_timer as timer
from contextlib import contextmanager


@contextmanager
def timeblock(label):
    start = timer()
    try:
        yield
    finally:
        end = timer()
        print ('{} elapsed: {}'.format(label, end - start))


def f1(a, n_sample):
    return random.sample(range(len(a)), n_sample)


def f2(a, n_sample):
    return np.random.choice(len(a), n_sample, replace=False)


def f3(a, n_sample):
    return np.random.default_rng().choice(len(a), n_sample, replace=False)


# Generate random array
a = np.random.uniform(1., 100., 10000)
# Number of samples' indexes to randomly take from a
n_sample = 100
# Number of times to repeat tested functions
N = 100000

print(f'{N} times {n_sample} samples')
with timeblock("random.sample"):
    for _ in range(N):
        f1(a, n_sample)

with timeblock("np.random.choice"):
    for _ in range(N):
        f2(a, n_sample)

with timeblock("np.random.default_rng().choice"):
    for _ in range(N):
        f3(a, n_sample)

answered Oct 16 '22 09:10

pkuderov

Related questions
                            
                                How to embed Bokeh server in Django application
                            
                                Reading streaming http response with Python "requests" library
                            
                                CGI script downloads instead of running
                            
                                Python HTTP server send JSON response
                            
                                what's the use of transformer_weights in scikit-learn pipeline?
                            
                                python pexpect clearing or flushing the line
                            
                                Run specific Django tests (with django-nose?)
                            
                                Python multi dimensional sparse array
                            
                                Using virtualenv with Sublime Text 3 and SublimeREPL
                            
                                Argmax of each row or column in scipy sparse matrix
                            
                                Cython: working with C++ streams
                            
                                Python logging module: duplicated console output [IPython Notebook/Qtconsole]
                            
                                Random numbers generation in PySpark
                            
                                Intersection of Two LineStrings Geopandas
                            
                                Using categorical data as features in sklean LogisticRegression
                            
                                Swift Optionals (or Haskell's Maybe) in Python?
                            
                                PyMySQL different updates in one query?
                            
                                In python, is math.acos() faster than numpy.arccos() for scalars?
                            
                                What is the difference between a NumPy array and a python list? [duplicate]
                            
                                Python - SQLite JSON1 load extension

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is random.sample faster than numpy's random.choice?

Tags:

python

random

numpy

Gabriel

People also ask

1 Answers

pkuderov

Recent Activity

Donate For Us