I have a 2d array, say, a = [ [1, 2, 3, 4], [5, 6,7, 8], [9, 10, 11, 12], ...[21, 22, 23, 24] ], and I would like to pick N elements from each row randomly, based on a probability distribution p which can be different for each row.
So basically, I'd like to do something like [ np.random.choice(a[i], N, p=p_arr[i]) for i in range(a.shape[0]) ] without using a loop, where p_arr is a 2d array the same shape as a. p_arr stores the probability distribution for each row in a.
The reason I want to avoid using a for loop is because running a line profiler shows that the loop is slowing down my code by a lot (I have large arrays to work with).
Is there a more python-ic way of doing this?
I checked out these links (here and here) but they don't answer my question.
Thank you!
An example of what I'd like to do without the loop:
a = np.ones([500, 500])
>>> p_arr = np.identity(a.shape[0])
>>> for i in range(a.shape[0]):
... a[i] = a[i]*np.arange(a.shape[0])
...
>>> [print(np.random.choice(a[i], p =p_arr[i])) for i in range(a.shape[0])]
Perhaps using a list comprehension instead of a loop would address the issue:
import numpy as np
shape = (10,10)
N = 4
distributions = np.random.rand(*shape)
distributions = distributions/(np.sum(distributions,axis=1)[:,None])
values = np.arange(shape[0]*shape[1]).reshape(shape)
sample = np.array([np.random.choice(v,N,p=r) for v,r in zip(values,distributions)])
output:
print(np.round(distributions,2))
[[0.03 0.22 0.1 0.09 0.2 0.1 0.11 0.05 0.08 0.01]
[0.04 0.12 0.13 0.03 0.16 0.22 0.16 0.05 0. 0.09]
[0.15 0.04 0.08 0.07 0.17 0.13 0.01 0.15 0.1 0.1 ]
[0.06 0.13 0.16 0.03 0.17 0.09 0.08 0.11 0.05 0.12]
[0.07 0.08 0.09 0.08 0.13 0.18 0.12 0.13 0.07 0.07]
[0.1 0.04 0.11 0.06 0.04 0.16 0.18 0.15 0.01 0.15]
[0.06 0.09 0.17 0.08 0.14 0.15 0.09 0.01 0.06 0.15]
[0.03 0.1 0.11 0.07 0.14 0.14 0.15 0.1 0.04 0.11]
[0.05 0.1 0.18 0.1 0.03 0.18 0.12 0.05 0.05 0.13]
[0.13 0.1 0.08 0.11 0.06 0.14 0.11 0. 0.14 0.14]]
print(sample)
[[ 6 4 8 5]
[16 19 15 10]
[25 20 24 23]
[37 34 30 31]
[41 44 46 45]
[59 55 53 57]
[64 63 65 61]
[79 75 76 77]
[85 81 83 88]
[99 96 93 90]]
If you want non-repeating samples on each line, there is another kind of optimization that you could try. By flattening the values and the distributions, you can create a non-repeating shuffle of indexes of the whole matrix according to the respective distributions of each line. With the flattened distributions, each group of values that belong to the same line will have (as a group) an equivalent distribution. This means that, if you reassemble the shuffled indexes on their original lines but keeping their stable shuffled ordered, you can then take a slice of the shuffle matrix to obtain your sample:
flatDist = distributions.reshape((distributions.size,))
flatDist = flatDist/np.sum(flatDist)
randomIdx = np.random.choice(np.arange(values.size),flatDist.size,replace=False,p=flatDist)
shuffleIdx = np.array([randomIdx//shape[1],randomIdx%shape[1]])
shuffleIdx = shuffleIdx[:,np.argsort(shuffleIdx[0,:],kind="stable")]
sample = values[tuple(shuffleIdx)].reshape(shape)[:,:N]
output:
print(sample)
[[ 3 7 2 5]
[13 12 14 16]
[27 23 25 29]
[37 31 33 36]
[47 45 48 49]
[59 50 52 54]
[62 61 60 66]
[72 78 70 77]
[87 82 83 86]
[92 98 95 93]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With