I'm looking for a away to simulate a classical frequency severity distribution, something like: X = sum(i = 1..N, Y_i), where N is for example poisson distributed and Y lognormal.
Simple naive numpy script would be:
import numpy as np
SIM_NUM = 3
X = []
for _i in range(SIM_NUM):
nr_claims = np.random.poisson(1)
temp = []
for _j in range(nr_claims):
temp.append(np.random.lognormal(0, 1))
X.append(sum(temp))
Now I try to vectorize that for a performance increase. A bit better is the following:
N = np.random.poisson(1, SIM_NUM)
X = []
for n in N:
X.append(sum(np.random.lognormal(0, 1, n)))
My question is if I can still vectorize the second loop? For example by simulating all the losses with:
N = np.random.poisson(1, SIM_NUM)
# print(N) would lead to something like [1 3 0]
losses = np.random.lognormal(0,1, sum(N))
# print(N) would lead to something like
#[ 0.56750244 0.84161871 0.41567216 1.04311742]
# X should now be [ 0.56750244, 0.84161871 + 0.41567216 + 1.04311742, 0]
Ideas that I have are "smart slicing" or "matrix multiplication with A = [[1, 0, 0, 0]],[0,1,1,1],[0,0,0,0]]. But I couldn't make something clever out these ideas.
I'm looking for fastest possible computation of X.
We can use np.bincount , which is quite efficient for such interval/ID based summing operations specially when working with 1D arrays. The implementation would look something like this -
# Generate all poisson distribution values in one go
pv = np.random.poisson(1,SIM_NUM)
# Use poisson values to get count of total for random lognormal needed.
# Generate all those random numbers again in vectorized way
rand_arr = np.random.lognormal(0, 1, pv.sum())
# Finally create IDs using pv as extents for use with bincount to do
# ID based and thus effectively interval-based summing
out = np.bincount(np.arange(pv.size).repeat(pv),rand_arr,minlength=SIM_NUM)
Runtime test -
Function definitions :
def original_app1(SIM_NUM):
X = []
for _i in range(SIM_NUM):
nr_claims = np.random.poisson(1)
temp = []
for _j in range(nr_claims):
temp.append(np.random.lognormal(0, 1))
X.append(sum(temp))
return X
def original_app2(SIM_NUM):
N = np.random.poisson(1, SIM_NUM)
X = []
for n in N:
X.append(sum(np.random.lognormal(0, 1, n)))
return X
def vectorized_app1(SIM_NUM):
pv = np.random.poisson(1,SIM_NUM)
r = np.random.lognormal(0, 1,pv.sum())
return np.bincount(np.arange(pv.size).repeat(pv),r,minlength=SIM_NUM)
Timings on large datasets :
In [199]: SIM_NUM = 1000
In [200]: %timeit original_app1(SIM_NUM)
100 loops, best of 3: 2.6 ms per loop
In [201]: %timeit original_app2(SIM_NUM)
100 loops, best of 3: 6.65 ms per loop
In [202]: %timeit vectorized_app1(SIM_NUM)
1000 loops, best of 3: 252 µs per loop
In [203]: SIM_NUM = 10000
In [204]: %timeit original_app1(SIM_NUM)
10 loops, best of 3: 26.1 ms per loop
In [205]: %timeit original_app2(SIM_NUM)
10 loops, best of 3: 77.5 ms per loop
In [206]: %timeit vectorized_app1(SIM_NUM)
100 loops, best of 3: 2.46 ms per loop
So, we are looking at some 10x+ speedup there.
You're looking for numpy.add.reduceat:
N = np.random.poisson(1, SIM_NUM)
losses = np.random.lognormal(0,1, np.sum(N))
x = np.zeros(SIM_NUM)
offsets = np.r_[0, np.cumsum(N[N>0])]
x[N>0] = np.add.reduceat(losses, offsets[:-1])
The case where n == 0 is handled separately, because of how reduceat works. Also, be sure to use numpy.sum on arrays instead of the much slower Python sum.
If this is faster than the other answer depends on the mean of your Poisson distribution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With