I have a (N, T+1) array of weights. Its rows are normalized, meaning
np.array_equal(W.sum(axis=1), np.ones(N))
returns True. Now I want to get N samples from np.arange(T+1) where to select the ith sample, I use the ith row of W. I could of course do this with a for loop:
import numpy as np
# Settings
N = 100
T = 20
# Create some normalized weights
W = np.random.rand(N, T+1)
W = W / W.sum(axis=1).reshape(-1, 1)
# Use a for loop to sample
samples = np.zeros(N)
for i in range(N):
samples[i] = np.random.choice(a=np.arange(T+1), size=1, p=W[i, :])
However I was wondering if there is a way to do this already in numpy/scipy or perhaps using some other library. I am hoping with something like this:
# or perhaps a=np.repeat(np.arange(T+1).reshape(-1,1), N, axis=1).T
samples = some_function(a=np.arange(T+1), size=N, p=W)
I am not sure if there's inbuilt support in numpy for this explicitly but you can randomly sample from a uniform distribution and convert them to samples from your categorical distribution using the following:
F = W.cumsum(axis=1) # get the CDF of all distributions
r = np.random.rand(N) # generate N random samples from U(0,1)
samples = np.argmax(F >= r[:, None], axis=1) # convert all these samples to the categorical samples from respective distributions
All operations above are vectorized and samples[i] corresponds to a sample from W[i]. I'll try to explain why and how this works.
Assume you have a categorical distribution p(i). To sample from it, you can use a random sample from uniform distribution U(0,1) and CDF F(i) such that, F[i] - F[i-1] = p[i]:
# Assume p = categorical probability distribution i.e., p[i] = probability of i
F = p.cumsum() # get the CDF
r = np.random.rand() # a random sample from U(0,1)
s = np.argmax(F >= r) # first index where F > r = a random sample from p - the categorical distribution
This works because s is such that F[s-1] < r <= F[s] (since s is the first index where F > r). Therefore, P(this event) = P(F[s-1] < r <= F[s]) = (F[s] - F[s-1])/1 = p[s]. Thus the probability of getting r in this range is the same as the probability of getting s.
In fact, using the vectorized form, you can now generalize the code to generate K samples for each i:
K = 10000
r = np.random.rand(N, K)
samples = np.argmax(F[:, :, None] >= r[:, None, :], axis=1)
Here samples[i] correspond to K=10000 samples from W[i]. You can also verify that the samples follow the categorical distributions we started from using following:
for i in range(T+1):
w = np.bincount(samples[i])/K # get empirical distribution from samples
error = ((W[i] - w)**2).sum() # get the error in empirical and theoretical distributions
print(error)
This will give you the error in empirical distributions and the distributions we started with (and the error will get smaller as you increase the number of samples - K which is what we expect to happen).
Let me know if this helps. Not sure how clear the explanation is though!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With