I want to sample ~10⁷ times from a population of ~10⁷ integers without replacements and with weights, each time picking 10 elements. After each sampling I change the weights. I have timed two approaches (python3 and numpy) in the following script. Both approaches seem painfully slow to me, do you see a way of speeding it up?
import numpy as np
import random
@profile
def test_choices():
population = list(range(10**7))
weights = np.random.uniform(size=10**7)
np_weights = np.array(weights)
def numpy_choice():
np_w = np_weights / sum(np_weights)
c = np.random.choice(population, size=10, replace=False, p=np_w)
def python_choice():
c = []
while len(c) < 10:
c += random.choices(population=population, weights=weights, k=10 - len(c))
c = list(set(c))
for i in range(10**1):
numpy_choice()
python_choice()
add_weight = np.random.uniform()
random_element = random.randint(0, 10**7)
weights[random_element] += add_weight
np_weights[random_element] += add_weight
test_choices()
With the timer result:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
24 10 20720062.0 2072006.2 56.6 numpy_choice()
25 10 15593925.0 1559392.5 42.6 python_choice()
Python's random module provides a sample() function for random sampling, randomly picking more than one element from the list without repeating elements. It returns a list of unique items chosen randomly from the list, sequence, or set. We call it random sampling without replacement.
It controls whether the sample is returned to the sample pool. If you want only unique samples then this should be false. Show activity on this post. You can use it when you want sample some elements from a list, and meanwhile you want the elements no repeat, then you can set the "replace=False".
NumPy random choice helps you create random samples. One common task in data analysis, statistics, and related fields is taking random samples of data. You'll see random samples in probability, Bayesian statistics, machine learning, and other subjects. Random samples are very common in data-related fields.
Since Python 3.6, you can do weighted random choice (with replacement) using random.choices. Show activity on this post. This function takes two arguments: A list of weights and a list containing the objects to choose from:
numpy is likely the best option. But here's another pure Python solution for weighted samples without replacement. There are a couple ways to define the purpose of the parameters for population and weights. population can be defined to represent the total population of items, and weights a list of biases that influence selection.
In Python, numpy has random.choice method which allows doing this: I’m always wary of using numpy without thinking because I know it incurs some overhead. This overhead is usually meaningful when small amounts of data are involved. In such a case, a pure Python implementation may be faster.
Using numpy.random.choice () method If you are using Python older than 3.6 version, than you have to use NumPy library to achieve weighted random numbers. With the help of choice () method, we can get the random samples of one dimensional array and return the random samples of numpy array. Syntax: numpy.random.choice (list,k, p=None)
This is just a comment on jdhesa's answer. The question was if it is useful to consider the case where only one weight is incresed -> Yes it is!
Example
@nb.njit(parallel=True)
def numba_choice_opt(population, weights, k,wc,b_full_wc_calc,ind,value):
# Get cumulative weights
if b_full_wc_calc:
acc=0
for i in range(weights.shape[0]):
acc+=weights[i]
wc[i]=acc
#Increase only one weight (faster than recalculating the cumulative weight)
else:
weights[ind]+=value
for i in nb.prange(ind,wc.shape[0]):
wc[i]+=value
# Total of weights
m = wc[-1]
# Arrays of sample and sampled indices
sample = np.empty(k, population.dtype)
sample_idx = np.full(k, -1, np.int32)
# Sampling loop
i = 0
while i < k:
# Pick random weight value
r = m * np.random.rand()
# Get corresponding index
idx = np.searchsorted(wc, r, side='right')
# Check index was not selected before
# If not using Numba you can just do `np.isin(idx, sample_idx)`
for j in range(i):
if sample_idx[j] == idx:
continue
# Save sampled value and index
sample[i] = population[idx]
sample_idx[i] = population[idx]
i += 1
return sample
Example
np.random.seed(0)
population = np.random.randint(100, size=1_000_000)
weights = np.random.rand(len(population))
k = 10
wc = np.empty_like(weights)
#Initial calculation
%timeit numba_choice_opt(population, weights, k,wc,True,0,0)
#1.41 ms ± 9.21 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Increase weight[100] by 3 and calculate
%timeit numba_choice_opt(population, weights, k,wc,False,100,3)
#213 µs ± 6.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#For comparison
#Please note that it is the memory allcocation of wc which makes
#it so much slower than the initial calculation from above
%timeit numba_choice(population, weights, k)
#4.23 ms ± 64.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can try something like this. I have accelerated my function with Numba but in my tests it is faster also without that.
import numpy as np
import numba as nb
@nb.njit
def numba_choice(population, weights, k):
# Get cumulative weights
wc = np.cumsum(weights)
# Total of weights
m = wc[-1]
# Arrays of sample and sampled indices
sample = np.empty(k, population.dtype)
sample_idx = np.full(k, -1, np.int32)
# Sampling loop
i = 0
while i < k:
# Pick random weight value
r = m * np.random.rand()
# Get corresponding index
idx = np.searchsorted(wc, r, side='right')
# Check index was not selected before
# If not using Numba you can just do `np.isin(idx, sample_idx)`
for j in range(i):
if sample_idx[j] == idx:
continue
# Save sampled value and index
sample[i] = population[idx]
sample_idx[i] = population[idx]
i += 1
return sample
Here is a quick comparison
def python_choice(population, weights, k):
c = []
while len(c) < 10:
c += random.choices(population=population, weights=weights, k=10 - len(c))
c = list(set(c))
return c
def numpy_choice(population, weights, k):
w = weights / weights.sum()
return np.random.choice(population, size=k, replace=False, p=w)
# Test
np.random.seed(0)
population = np.random.randint(100, size=1_000_000)
weights = np.random.rand(len(population))
k = 10
print(python_choice(population, weights, k))
# [96, 99, 90, 46, 78, 16, 17, 22, 58, 30]
print(numpy_choice(population, weights, k))
# [ 9 61 1 18 41 89 55 4 53 40]
print(numba_choice(population, weights, k))
# [66 82 91 62 9 56 71 14 32 26]
%timeit python_choice(population, weights, k)
# 198 ms ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit numpy_choice(population, weights, k)
# 13.4 ms ± 65.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit numba_choice(population, weights, k)
# 2.08 ms ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
EDIT: Here is how it could go without Numba:
import numpy as np
def loop_choice(population, weights, k):
wc = np.cumsum(weights)
m = wc[-1]
sample = np.empty(k, population.dtype)
sample_idx = np.full(k, -1, np.int32)
i = 0
while i < k:
r = m * np.random.rand()
idx = np.searchsorted(wc, r, side='right')
if np.isin(idx, sample_idx):
continue
sample[i] = population[idx]
sample_idx[i] = population[idx]
i += 1
return sample
# Setup from before...
%timeit loop_choice(population, weights, k)
# 3.55 ms ± 23.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
EDIT: Just a small test to check the samples are adjusted to the weights:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
n = 200
population = np.arange(n)
weights = np.sin(np.linspace(0, 2 * np.pi, n)) + 1
k = 15
r = 1600
a = np.zeros(n, np.int32)
for _ in range(r):
c = numba_choice(population, weights, k)
np.add.at(a, c, 1)
plt.figure()
plt.plot(weights / weights.sum(), label='Weights')
plt.plot(a / (k * r), label='Samples')
plt.legend()
plt.tight_layout()
plt.show()
Result:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With