I am writing a simulation for a wireless network in python using numpy and cython, where suppose there is a number of nodes no_nodes
scattered randomly on the 2d plane which send out some waveforms and their respective receivers, again scattered randomly on the 2d plane. Each transmitting node produces a waveform I call output
(each one can produce an output of different length).
What I want to do is to sum these outputs from each node to one big waveform that will be the input to each receiver for demodulation etc. Now two key-points:
start_clock
and an end_clock
has to be maintened per transmitting node so as to sum the waveforms properlyj
transmitting node will be attenuated before being received by the i
node according to a function attenuate(i,j)
So here is the code:
#create empty 2d array (no_rx_nodes x no_samples for each waveform)
waveforms = np.zeros((no_nodes, max(end_clock)))
for i in range(no_nodes): #calculate the waveform for each receiver
for j in range(no_nodes): #sum the waveforms produced by each transmitter
waveforms[i, start_clock[j]:end_clock[j]] += output[j,:] * attenuate(i,j)
return waveforms
Some comments on the above:
output[j, :]
is the output waveform of transmitter jwaveforms[i,:]
is the waveform received by receiver iI hope it is fairly clear what I am trying to accomplish here. Because the waveforms produced are very large (about 10^6 samples), I also tried turning this code into cython but without noticing any particular speedup (maybe 5-10x better but no more). I was wondering if there is anything else I can resort to so as to get a speed up because it is a real bottleneck to the whole simulation (it takes to compute almost as much time as the rest of the code, which in fact is quite more complicated than that).
this is a memory bandwidth bound problem with about 3GB/s memory bandwidth the best you can get out of this is around 2-4ms for the inner loop. to reach that bound you need to block your inner loop to utilize the cpu caches better (numexpr does this for you):
for i in range(no_nodes):
for j in range(no_nodes):
# should be chosen so all operands fit in the (next-to-)last level cache
# first level is normally too small to be usable due to python overhead
s = 15000
a = attenuation[i,j]
o = output[j]
w = waveforms[i]
for k in range(0, w.size, s):
u = min(k + s, w.size)
w[k:u] += o[k:u] * a
# or: numexpr.evaluate("w + o * a", out=w)
using float32 data instead of float64 should also half the memory bandwidth requirements and double the performance.
To get larger speedups you have to redesign your full algorithm to have a better data locality
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With