In my code I have for loop that indexes over a multidimensional numpy array and does some operation using the sub-array that is obtained at each iteration. It looks like this
for sub in Arr:
#do stuff using sub
Now the stuff that is done using sub
is fully vectorized, so it should be efficient. On the other hand this loop iterates about ~10^5
times and is the bottleneck. Do you think I will get an improvement by offloading this part to C. I am somewhat reluctant to do so because the do stuff using sub
uses broadcasting, slicing, smart-indexing tricks that would be tedious to write in plain C. I would also welcome thoughts and suggestions about how to deal with broadcasting, slicing, smart-indexing when offloading computation to C.
If you can't 'vectorize' the entire operation and looping is indeed the bottleneck, then I highly recommend using Cython. I've been dabbling around with it recently and it is straightforward to work with and has a decent interface with numpy. For something like a langevin integrator I saw a 115x speedup over a decent implementation in numpy. See the documentation here:
http://docs.cython.org/src/tutorial/numpy.html
and I also recommend looking at the following paper
You may see satisfactory speedups by just typing the input array and the loop counter, but if you want to leverage the full potential of cython, then you are going to have to hardcode the equivalent broadcasting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With