Numpy `random.choice` but using different weights each time we sample

Question

I have a (N, T+1) array of weights. Its rows are normalized, meaning

np.array_equal(W.sum(axis=1), np.ones(N))

returns True. Now I want to get N samples from np.arange(T+1) where to select the ith sample, I use the ith row of W. I could of course do this with a for loop:

import numpy as np

# Settings
N = 100
T = 20

# Create some normalized weights
W = np.random.rand(N, T+1)
W = W / W.sum(axis=1).reshape(-1, 1)

# Use a for loop to sample
samples = np.zeros(N)
for i in range(N):
    samples[i] = np.random.choice(a=np.arange(T+1), size=1, p=W[i, :])

However I was wondering if there is a way to do this already in numpy/scipy or perhaps using some other library. I am hoping with something like this:

# or perhaps a=np.repeat(np.arange(T+1).reshape(-1,1), N, axis=1).T
samples = some_function(a=np.arange(T+1), size=N, p=W)

Abhinav Goyal · Accepted Answer

I am not sure if there's inbuilt support in numpy for this explicitly but you can randomly sample from a uniform distribution and convert them to samples from your categorical distribution using the following:

F = W.cumsum(axis=1)  # get the CDF of all distributions
r = np.random.rand(N)  # generate N random samples from U(0,1)
samples = np.argmax(F >= r[:, None], axis=1)  # convert all these samples to the categorical samples from respective distributions

All operations above are vectorized and samples[i] corresponds to a sample from W[i]. I'll try to explain why and how this works.

Assume you have a categorical distribution p(i). To sample from it, you can use a random sample from uniform distribution U(0,1) and CDF F(i) such that, F[i] - F[i-1] = p[i]:

# Assume p = categorical probability distribution i.e., p[i] = probability of i
F = p.cumsum()  # get the CDF
r = np.random.rand()  # a random sample from U(0,1)
s = np.argmax(F >= r)  # first index where F > r = a random sample from p - the categorical distribution

This works because s is such that F[s-1] < r <= F[s] (since s is the first index where F > r). Therefore, P(this event) = P(F[s-1] < r <= F[s]) = (F[s] - F[s-1])/1 = p[s]. Thus the probability of getting r in this range is the same as the probability of getting s.

In fact, using the vectorized form, you can now generalize the code to generate K samples for each i:

K = 10000
r = np.random.rand(N, K)
samples = np.argmax(F[:, :, None] >= r[:, None, :], axis=1)

Here samples[i] correspond to K=10000 samples from W[i]. You can also verify that the samples follow the categorical distributions we started from using following:

for i in range(T+1):
    w = np.bincount(samples[i])/K  # get empirical distribution from samples
    error = ((W[i] - w)**2).sum()  # get the error in empirical and theoretical distributions
    print(error)

This will give you the error in empirical distributions and the distributions we started with (and the error will get smaller as you increase the number of samples - K which is what we expect to happen).

Let me know if this helps. Not sure how clear the explanation is though!

Numpy `random.choice` but using different weights each time we sample

Tags:

python

numpy

Physics_Student

1 Answers

Abhinav Goyal

Recent Activity

Donate For Us

Numpy `random.choice` but using different weights each time we sample

Tags:

python

numpy

Physics_Student

1 Answers

Abhinav Goyal

Related questions

Recent Activity

Donate For Us