In my python code, I need to loop over about 25 million times, which I want to optimize as much as possible. The operations within the loop are very simple. In order to make the code efficient, I have used numba module, which helps tremendously, but if possible I would like to further optimize the code.
Here is a complete working example:
import numba as nb
import numpy as np
import time
#######create some synthetic data for illustration purpose##################
size=5000
eps = 0.2
theta_c = 0.4
temp = np.ones(size)
neighbour = np.random.randint(size, size=(size, 3))
coschi = np.random.random_sample((size))
theta = np.random.random_sample((size))*np.pi/2
pwr = np.cos(theta)
###################end of dummy data##########################
###################-----main loop------###############
@nb.jit(fastmath=True)
def func(theta, pwr, neighbour, coschi, temp):
for k in range(np.argmax(pwr), 5000*(pwr.size)):
n = k%pwr.size
if (np.abs(theta[n]-np.pi/2.)<np.abs(theta_c)):
adj = neighbour[n,1]
else:
adj = neighbour[n,0]
psi_diff = np.abs(np.arccos(coschi[adj])-np.arccos(coschi[n]))
temp5 = temp[adj]**5;
e_temp = 1.- np.exp(-temp5*psi_diff/np.abs(eps))
temp[n] = temp[adj] + (e_temp)/temp5*(pwr[n] - temp[adj]**4)
return temp
#check time
time1 = time.time()
temp = func(theta, pwr, neighbour, coschi, temp)
print("Took: ", time.time()-time1, " seconds.")
This takes 3.49 seconds on my machine.
I need to run this code several thousand time for some model fitting purpose, and therefore the optimization of even 1 second means saving tens of hours for me.
What can be done to further optimize this code?
Let me begin with some general comments:
If you use numba and really care about performance you should avoid any possibility that numba creates object-mode code. That means you should use numba.njit(...) or numba.jit(nopython=True, ...) instead of numba.jit(...).
That doesn't make a difference in your case but it makes the intent clearer and throws exceptions as soon as something isn't supported in (the fast) nopython mode.
You should be careful what you time and how. The first call to a numba-jitted function (that is not compiled ahead-of-time) will include the compilation cost. So you need to run it once before you time it to get accurate timings. And for more accurate timings you should call the function more than once. I like IPythons %timeit in Jupyter Notebooks/Lab to get some rough idea about performance.
So I will use:
res1 = func(theta, pwr, neighbour, coschi, np.ones(size))
res2 = # other approach
np.testing.assert_allclose(res1, res2)
%timeit func(theta, pwr, neighbour, coschi, np.ones(size))
%timeit # other approach
That way I use the first call (which includes the compilation time) with an assert to make sure it really produces (almost) the same output and then time the function using a more robust timing method (compared to time).
np.arccosNow let's start with some actual performance optimizations: One obvious one is that you can hoist some "invariants", for example the np.arccos(coschi[...]) is calculated much more often than there are actual elements in the coschi. You iterate over each element in coschi roughly 5000 times and it's doing two np.arccos per loop! So let's calculate the arccos of the coschi once and store it in an intermediate array so one can access that inside the loop:
@nb.njit(fastmath=True)
def func2(theta, pwr, neighbour, coschi, temp):
arccos_coschi = np.arccos(coschi)
for k in range(np.argmax(pwr), 5000 * pwr.size):
n = k % pwr.size
if np.abs(theta[n] - np.pi / 2.) < np.abs(theta_c):
adj = neighbour[n, 1]
else:
adj = neighbour[n, 0]
psi_diff = np.abs(arccos_coschi[adj] - arccos_coschi[n])
temp5 = temp[adj]**5;
e_temp = 1. - np.exp(-temp5 * psi_diff / np.abs(eps))
temp[n] = temp[adj] + e_temp / temp5 * (pwr[n] - temp[adj]**4)
return temp
On my computer that's significantly faster already:
1.73 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # original
811 ms ± 49.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # func2
However it comes at a price: The results will be different! I consistently get significantly different results using the original and the hoisted version with fastmath=True. However the results are (almost) equal with fastmath=False. It seems that fastmath enables some rigorous optimizations with np.arccos(coschi[adj]) - np.arccos(coschi[n]) that are not possible when the np.arccos is hoisted. In my personal opinion I would neglect the fastmath=True if you care about exact results or you have tested that the accuracy of the results isn't significantly affected by fastmath!
adjThe next thing to hoist would be the adj, it's also calculated much more often than necessary:
@nb.njit(fastmath=True)
def func3(theta, pwr, neighbour, coschi, temp):
arccos_coschi = np.arccos(coschi)
associated_neighbour = np.empty(neighbour.shape[0], nb.int64)
for idx in range(neighbour.shape[0]):
if np.abs(theta[idx] - np.pi / 2.) < np.abs(theta_c):
associated_neighbour[idx] = neighbour[idx, 1]
else:
associated_neighbour[idx] = neighbour[idx, 0]
for k in range(np.argmax(pwr), 5000 * pwr.size):
n = k % pwr.size
adj = associated_neighbour[n]
psi_diff = np.abs(arccos_coschi[adj] - arccos_coschi[n])
temp5 = temp[adj]**5;
e_temp = 1. - np.exp(-temp5 * psi_diff / np.abs(eps))
temp[n] = temp[adj] + e_temp / temp5 * (pwr[n] - temp[adj]**4)
return temp
The effect of this isn't that big, but measurable:
1.75 s ± 110 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # original
761 ms ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # func2
660 ms ± 8.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # func3
Hoisting additional calculations seemed to have no effect on the performance on my computer so I don't include them here. So it seems that's how far you can get without changing the algorithm.
However I would recommend to separate the hoisting in other functions and make all variables function parameters instead of looking up the globals. That probably won't result in speed-ups but it could make the code more readable:
@nb.njit
def func4_inner(indices, pwr, associated_neighbour, arccos_coschi, temp, abs_eps):
for n in indices:
adj = associated_neighbour[n]
psi_diff = np.abs(arccos_coschi[adj] - arccos_coschi[n])
temp5 = temp[adj]**5;
e_temp = 1. - np.exp(-temp5 * psi_diff / abs_eps)
temp[n] = temp[adj] + e_temp / temp5 * (pwr[n] - temp[adj]**4)
return temp
@nb.njit
def get_relevant_neighbor(neighbour, abs_theta_minus_pi_half, abs_theta_c):
associated_neighbour = np.empty(neighbour.shape[0], nb.int64)
for idx in range(neighbour.shape[0]):
if abs_theta_minus_pi_half[idx] < abs_theta_c:
associated_neighbour[idx] = neighbour[idx, 1]
else:
associated_neighbour[idx] = neighbour[idx, 0]
return associated_neighbour
def func4(theta, pwr, neighbour, coschi, temp, theta_c, eps):
arccos_coschi = np.arccos(coschi)
abs_theta_minus_pi_half = np.abs(theta - (np.pi / 2.))
relevant_neighbors = get_relevant_neighbor(neighbour, abs_theta_minus_pi_half, abs(theta_c))
argmax_pwr = np.argmax(pwr)
indices = np.tile(np.arange(pwr.size), 5000)[argmax_pwr:]
return func4_inner(indices, pwr, relevant_neighbors, arccos_coschi, temp, abs(eps))
Here I also did some additional changes:
np.tile and slicing instead of the range approach together with %.np.arccos.1.79 s ± 49.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # original
844 ms ± 41.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # func2
707 ms ± 31.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # func3
550 ms ± 4.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # func4
So in the end the latest approach is roughly 3 times faster (without fastmath) than the original approach. If you're sure that you want to use fastmath, then just apply fastmath=True on the func4_inner and it will be even faster:
499 ms ± 4.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # func4 with fastmath on func4_inner
However as I have already stated fastmath may not be appropriate if you want exact (or at least not too inexact) results.
Also several optimizations here depend a lot on the available hardware and processor caches (especially for memory-bandwidth-limited portions of the code). You have to check how these approaches perform relative to each other on your computer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With