Speed up numpy nested loop

Question

I am writing a simulation for a wireless network in python using numpy and cython, where suppose there is a number of nodes no_nodes scattered randomly on the 2d plane which send out some waveforms and their respective receivers, again scattered randomly on the 2d plane. Each transmitting node produces a waveform I call output (each one can produce an output of different length).

What I want to do is to sum these outputs from each node to one big waveform that will be the input to each receiver for demodulation etc. Now two key-points:

the transmitters send asynchronously and therefore a start_clock and an end_clock has to be maintened per transmitting node so as to sum the waveforms properly
the output of the j transmitting node will be attenuated before being received by the i node according to a function attenuate(i,j)

So here is the code:

#create empty 2d array (no_rx_nodes x no_samples for each waveform)
waveforms = np.zeros((no_nodes, max(end_clock))) 

for i in range(no_nodes): #calculate the waveform for each receiver
    for j in range(no_nodes): #sum the waveforms produced by each transmitter
        waveforms[i, start_clock[j]:end_clock[j]] += output[j,:] * attenuate(i,j)
return waveforms

Some comments on the above:

output[j, :] is the output waveform of transmitter j
waveforms[i,:] is the waveform received by receiver i

I hope it is fairly clear what I am trying to accomplish here. Because the waveforms produced are very large (about 10^6 samples), I also tried turning this code into cython but without noticing any particular speedup (maybe 5-10x better but no more). I was wondering if there is anything else I can resort to so as to get a speed up because it is a real bottleneck to the whole simulation (it takes to compute almost as much time as the rest of the code, which in fact is quite more complicated than that).

jtaylor · Accepted Answer

this is a memory bandwidth bound problem with about 3GB/s memory bandwidth the best you can get out of this is around 2-4ms for the inner loop. to reach that bound you need to block your inner loop to utilize the cpu caches better (numexpr does this for you):

for i in range(no_nodes):
    for j in range(no_nodes):
        # should be chosen so all operands fit in the (next-to-)last level cache
        # first level is normally too small to be usable due to python overhead
        s  = 15000 
        a = attenuation[i,j]
        o = output[j]
        w = waveforms[i]
        for k in range(0, w.size, s): 
            u = min(k + s, w.size)
            w[k:u] += o[k:u] * a
        # or: numexpr.evaluate("w + o * a", out=w)

using float32 data instead of float64 should also half the memory bandwidth requirements and double the performance.

To get larger speedups you have to redesign your full algorithm to have a better data locality

Speed up numpy nested loop

Tags:

python

slice

for-loop

numpy

user113478

1 Answers

jtaylor

Recent Activity

Donate For Us

Speed up numpy nested loop

Tags:

python

slice

for-loop

numpy

user113478

1 Answers

jtaylor

Related questions

Recent Activity

Donate For Us