Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up random weighted choice without replacement in python

I want to sample ~10⁷ times from a population of ~10⁷ integers without replacements and with weights, each time picking 10 elements. After each sampling I change the weights. I have timed two approaches (python3 and numpy) in the following script. Both approaches seem painfully slow to me, do you see a way of speeding it up?

import numpy as np
import random

@profile
def test_choices():
    population = list(range(10**7))
    weights = np.random.uniform(size=10**7)
    np_weights = np.array(weights)

    def numpy_choice():
        np_w = np_weights / sum(np_weights)
        c = np.random.choice(population, size=10, replace=False, p=np_w)

    def python_choice():
        c = []
        while len(c) < 10:
            c += random.choices(population=population, weights=weights, k=10 - len(c))
            c = list(set(c))

    for i in range(10**1):

        numpy_choice()
        python_choice()

        add_weight = np.random.uniform()
        random_element = random.randint(0, 10**7)
        weights[random_element] += add_weight
        np_weights[random_element] += add_weight


test_choices()

With the timer result:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    24        10   20720062.0 2072006.2     56.6          numpy_choice()
    25        10   15593925.0 1559392.5     42.6          python_choice()
like image 968
Nik Avatar asked Sep 30 '20 09:09

Nik


People also ask

How do you do random sampling without replacement in Python?

Python's random module provides a sample() function for random sampling, randomly picking more than one element from the list without repeating elements. It returns a list of unique items chosen randomly from the list, sequence, or set. We call it random sampling without replacement.

What does replace do in NP random choice?

It controls whether the sample is returned to the sample pool. If you want only unique samples then this should be false. Show activity on this post. You can use it when you want sample some elements from a list, and meanwhile you want the elements no repeat, then you can set the "replace=False".

What is NP random choice in Python?

NumPy random choice helps you create random samples. One common task in data analysis, statistics, and related fields is taking random samples of data. You'll see random samples in probability, Bayesian statistics, machine learning, and other subjects. Random samples are very common in data-related fields.

How to do weighted random choice with replacement in Python?

Since Python 3.6, you can do weighted random choice (with replacement) using random.choices. Show activity on this post. This function takes two arguments: A list of weights and a list containing the objects to choose from:

What is the best Python tool for weighted samples without replacement?

numpy is likely the best option. But here's another pure Python solution for weighted samples without replacement. There are a couple ways to define the purpose of the parameters for population and weights. population can be defined to represent the total population of items, and weights a list of biases that influence selection.

Is it better to use NumPy or Python for random choice?

In Python, numpy has random.choice method which allows doing this: I’m always wary of using numpy without thinking because I know it incurs some overhead. This overhead is usually meaningful when small amounts of data are involved. In such a case, a pure Python implementation may be faster.

How to get random samples of array in Python?

Using numpy.random.choice () method If you are using Python older than 3.6 version, than you have to use NumPy library to achieve weighted random numbers. With the help of choice () method, we can get the random samples of one dimensional array and return the random samples of numpy array. Syntax: numpy.random.choice (list,k, p=None)


Video Answer


2 Answers

This is just a comment on jdhesa's answer. The question was if it is useful to consider the case where only one weight is incresed -> Yes it is!

Example

@nb.njit(parallel=True)
def numba_choice_opt(population, weights, k,wc,b_full_wc_calc,ind,value):
    # Get cumulative weights
    if b_full_wc_calc:
        acc=0
        for i in range(weights.shape[0]):
            acc+=weights[i]
            wc[i]=acc
    #Increase only one weight (faster than recalculating the cumulative  weight)
    else:
        weights[ind]+=value
        for i in nb.prange(ind,wc.shape[0]):
            wc[i]+=value

    # Total of weights
    m = wc[-1]
    # Arrays of sample and sampled indices
    sample = np.empty(k, population.dtype)
    sample_idx = np.full(k, -1, np.int32)
    # Sampling loop
    i = 0
    while i < k:
        # Pick random weight value
        r = m * np.random.rand()
        # Get corresponding index
        idx = np.searchsorted(wc, r, side='right')
        # Check index was not selected before
        # If not using Numba you can just do `np.isin(idx, sample_idx)`
        for j in range(i):
            if sample_idx[j] == idx:
                continue
        # Save sampled value and index
        sample[i] = population[idx]
        sample_idx[i] = population[idx]
        i += 1
    return sample

Example

np.random.seed(0)
population = np.random.randint(100, size=1_000_000)
weights = np.random.rand(len(population))
k = 10
wc = np.empty_like(weights)

#Initial calculation 
%timeit numba_choice_opt(population, weights, k,wc,True,0,0)
#1.41 ms ± 9.21 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

#Increase weight[100] by 3 and calculate
%timeit numba_choice_opt(population, weights, k,wc,False,100,3)
#213 µs ± 6.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#For comparison
#Please note that it is the memory allcocation of wc which makes
#it so much slower than the initial calculation from above
%timeit numba_choice(population, weights, k)
#4.23 ms ± 64.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 138
max9111 Avatar answered Oct 23 '22 18:10

max9111


You can try something like this. I have accelerated my function with Numba but in my tests it is faster also without that.

import numpy as np
import numba as nb

@nb.njit
def numba_choice(population, weights, k):
    # Get cumulative weights
    wc = np.cumsum(weights)
    # Total of weights
    m = wc[-1]
    # Arrays of sample and sampled indices
    sample = np.empty(k, population.dtype)
    sample_idx = np.full(k, -1, np.int32)
    # Sampling loop
    i = 0
    while i < k:
        # Pick random weight value
        r = m * np.random.rand()
        # Get corresponding index
        idx = np.searchsorted(wc, r, side='right')
        # Check index was not selected before
        # If not using Numba you can just do `np.isin(idx, sample_idx)`
        for j in range(i):
            if sample_idx[j] == idx:
                continue
        # Save sampled value and index
        sample[i] = population[idx]
        sample_idx[i] = population[idx]
        i += 1
    return sample

Here is a quick comparison

def python_choice(population, weights, k):
    c = []
    while len(c) < 10:
        c += random.choices(population=population, weights=weights, k=10 - len(c))
        c = list(set(c))
    return c

def numpy_choice(population, weights, k):
    w = weights / weights.sum()
    return np.random.choice(population, size=k, replace=False, p=w)

# Test
np.random.seed(0)
population = np.random.randint(100, size=1_000_000)
weights = np.random.rand(len(population))
k = 10
print(python_choice(population, weights, k))
# [96, 99, 90, 46, 78, 16, 17, 22, 58, 30]
print(numpy_choice(population, weights, k))
# [ 9 61  1 18 41 89 55  4 53 40]
print(numba_choice(population, weights, k))
# [66 82 91 62  9 56 71 14 32 26]

%timeit python_choice(population, weights, k)
# 198 ms ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit numpy_choice(population, weights, k)
# 13.4 ms ± 65.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit numba_choice(population, weights, k)
# 2.08 ms ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

EDIT: Here is how it could go without Numba:

import numpy as np

def loop_choice(population, weights, k):
    wc = np.cumsum(weights)
    m = wc[-1]
    sample = np.empty(k, population.dtype)
    sample_idx = np.full(k, -1, np.int32)
    i = 0
    while i < k:
        r = m * np.random.rand()
        idx = np.searchsorted(wc, r, side='right')
        if np.isin(idx, sample_idx):
            continue
        sample[i] = population[idx]
        sample_idx[i] = population[idx]
        i += 1
    return sample

# Setup from before...
%timeit loop_choice(population, weights, k)
# 3.55 ms ± 23.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

EDIT: Just a small test to check the samples are adjusted to the weights:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)
n = 200
population = np.arange(n)
weights = np.sin(np.linspace(0, 2 * np.pi, n)) + 1
k = 15
r = 1600
a = np.zeros(n, np.int32)
for _ in range(r):
    c = numba_choice(population, weights, k)
    np.add.at(a, c, 1)
plt.figure()
plt.plot(weights / weights.sum(), label='Weights')
plt.plot(a / (k * r), label='Samples')
plt.legend()
plt.tight_layout()
plt.show()

Result:

Sample result

like image 3
jdehesa Avatar answered Oct 23 '22 17:10

jdehesa