I've read a lot "avoid for loops with numpy". So, I tried. I was using this code (simplified version). Some auxiliary data:
In[1]: import numpy as np
resolution = 1000 # this parameter varies
tim = np.linspace(-np.pi, np.pi, resolution)
prec = np.arange(1, resolution + 1)
prec = 2 * prec - 1
values = np.zeros_like(tim)
My first implementation was with for
loop:
In[2]: for i, ti in enumerate(tim):
values[i] = np.sum(np.sin(prec * ti))
Then, I got rid of the explicit for
cycle, and achieved this:
In[3]: values = np.sum(np.sin(tim[:, np.newaxis] * prec), axis=1)
And this solution was faster for small arrays, but when I scaled up I got such time dependence:
What I'm missing or is it normal behavior? And if it is not, where to dig?
EDIT: According to the comments, here is some additional information. The times were measured with IPython's %timeit
and %%timeit
, every run was performed on the fresh kernel. My laptop is acer aspire v7-482pg (i7, 8GB). I'm using:
Short Answer: Yes, Numpy uses C and Fortran for the expensive computations and this is what makes it so fast. Long Answer: Loops and function calls are dog slow in dynamic languages, unless JITed, then they are just slow (except for luajit2, I hear).
This vectorized approach is designed to push the loop into the compiled layer that underlies NumPy, leading to much faster execution. Vectorized operations in NumPy are implemented via ufuncs, whose main purpose is to quickly execute repeated operations on values in NumPy arrays.
Because the Numpy array is densely packed in memory due to its homogeneous type, it also frees the memory faster. So overall a task executed in Numpy is around 5 to 100 times faster than the standard python list, which is a significant leap in terms of speed.
NumPy is fast because it can do all its calculations without calling back into Python. Since this function involves looping in Python, we lose all the performance benefits of using NumPy. For a 10,000,000-entry NumPy array, this functions takes 2.5 seconds to run on my computer.
This is normal and expected behaviour. It's just too simplified to apply the "avoid for loops with numpy" statement everywere. If you're dealing with inner loops it's (almost) always true. But in case of outer loops (like in your case) there are far more exceptions. Especially if the alternative is to use broadcasting because this speeds up your operation by using a lot more memory.
Just to add a bit of background to that "avoid for loops with numpy" statement:
NumPy arrays are stored as contiguous arrays with c types. The Python int
is not the same as a C int
! So whenever you iterate over each item in an array you need to plug the item from the array, convert it to a Python int
and then do whatever you want to do with it and finally you may need to convert it to a c integer again (called boxing and unboxing the value). For example you want to sum
the items in an array using Python:
import numpy as np
arr = np.arange(1000)
%%timeit
acc = 0
for item in arr:
acc += item
# 1000 loops, best of 3: 478 µs per loop
You better use numpy:
%timeit np.sum(arr)
# 10000 loops, best of 3: 24.2 µs per loop
Even if you push the loop into Python C code you're far away from the numpy performance:
%timeit sum(arr)
# 1000 loops, best of 3: 387 µs per loop
There might be exceptions from this rule but these will be really sparse. At least as long as there is some equivalent numpy functionality. So if you would iterate over single elements then you should use numpy.
Sometimes a plain python loop is enough. It's not widly advertised but numpy functions have a huge overhead compared to Python functions. For example consider a 3-element array:
arr = np.arange(3)
%timeit np.sum(arr)
%timeit sum(arr)
Which one will be faster?
Solution: The Python function performs better than the numpy solution:
# 10000 loops, best of 3: 21.9 µs per loop <- numpy
# 100000 loops, best of 3: 6.27 µs per loop <- python
But what does this have to do with your example? Not all that much in fact because you always use numpy-functions on arrays (not single elements and not even few elements) so your inner loop already uses the optimized functions. That's why both perform roughly the same (+/- a factor 10 with very few elements to a factor 2 at roughly 500 elements). But that's not really the loop overhead, it's the function call overhead!
Using line-profiler and a resolution = 100
:
def fun_func(tim, prec, values):
for i, ti in enumerate(tim):
values[i] = np.sum(np.sin(prec * ti))
%lprun -f fun_func fun_func(tim, prec, values)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 101 752 7.4 5.7 for i, ti in enumerate(tim):
3 100 12449 124.5 94.3 values[i] = np.sum(np.sin(prec * ti))
95% is spent inside the loop, I've even split the loop body into several parts to verify this:
def fun_func(tim, prec, values):
for i, ti in enumerate(tim):
x = prec * ti
x = np.sin(x)
x = np.sum(x)
values[i] = x
%lprun -f fun_func fun_func(tim, prec, values)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 101 609 6.0 3.5 for i, ti in enumerate(tim):
3 100 4521 45.2 26.3 x = prec * ti
4 100 4646 46.5 27.0 x = np.sin(x)
5 100 6731 67.3 39.1 x = np.sum(x)
6 100 714 7.1 4.1 values[i] = x
The time consumers are np.multiply
, np.sin
, np.sum
here, as you can easily check by comparing their time per call with their overhead:
arr = np.ones(1, float)
%timeit np.sum(arr)
# 10000 loops, best of 3: 22.6 µs per loop
So as soon as the comulative function call overhead is small compared to the calculation runtime you'll have similar runtimes. Even with 100 items you're quite close to the overhead time. The trick is to know at which point they break-even. With 1000 items the call-overhead is still significant:
%lprun -f fun_func fun_func(tim, prec, values)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1001 5864 5.9 2.4 for i, ti in enumerate(tim):
3 1000 42817 42.8 17.2 x = prec * ti
4 1000 119327 119.3 48.0 x = np.sin(x)
5 1000 73313 73.3 29.5 x = np.sum(x)
6 1000 7287 7.3 2.9 values[i] = x
But with resolution = 5000
the overhead is quite low compared to the runtime:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 5001 29412 5.9 0.9 for i, ti in enumerate(tim):
3 5000 388827 77.8 11.6 x = prec * ti
4 5000 2442460 488.5 73.2 x = np.sin(x)
5 5000 441337 88.3 13.2 x = np.sum(x)
6 5000 36187 7.2 1.1 values[i] = x
When you spent 500us in each np.sin
call you don't care about the 20us overhead anymore.
A word of caution is probably in order: line_profiler
probably includes some additional overhead per line, maybe also per function call, so the point at which the function call overhead becomes negligable might be lower!!!
I started by profiling the first solution, let's do the same with the second solution:
def fun_func(tim, prec, values):
x = tim[:, np.newaxis]
x = x * prec
x = np.sin(x)
x = np.sum(x, axis=1)
return x
Again using the line_profiler with resolution=100
:
%lprun -f fun_func fun_func(tim, prec, values)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1 27 27.0 0.5 x = tim[:, np.newaxis]
3 1 638 638.0 12.9 x = x * prec
4 1 3963 3963.0 79.9 x = np.sin(x)
5 1 326 326.0 6.6 x = np.sum(x, axis=1)
6 1 4 4.0 0.1 return x
This already exceeds the overhead time significantly and thus we end up a factor 10 faster compared to the loop.
I also did the profiling for resolution=1000
:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1 28 28.0 0.0 x = tim[:, np.newaxis]
3 1 17716 17716.0 14.6 x = x * prec
4 1 91174 91174.0 75.3 x = np.sin(x)
5 1 12140 12140.0 10.0 x = np.sum(x, axis=1)
6 1 10 10.0 0.0 return x
and with precision=5000
:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1 34 34.0 0.0 x = tim[:, np.newaxis]
3 1 333685 333685.0 11.1 x = x * prec
4 1 2391812 2391812.0 79.6 x = np.sin(x)
5 1 280832 280832.0 9.3 x = np.sum(x, axis=1)
6 1 14 14.0 0.0 return x
The 1000 size is still faster, but as we've seen there the call overhead was still not-negligable in the loop-solution. But for resolution = 5000
the time spent in each step is almost identical (some are a bit slower, others faster but overall quite similar)
Another effect is that the actual broadcasting when you do the multiplication becomes significant. Even with the very smart numpy solutions this still includes some additional calculations. For resolution=10000
you see that the broadcasting multiplication starts to take up more "% time" in relation to the loop solution:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def broadcast_solution(tim, prec, values):
2 1 37 37.0 0.0 x = tim[:, np.newaxis]
3 1 1783345 1783345.0 13.9 x = x * prec
4 1 9879333 9879333.0 77.1 x = np.sin(x)
5 1 1153789 1153789.0 9.0 x = np.sum(x, axis=1)
6 1 11 11.0 0.0 return x
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 def loop_solution(tim, prec, values):
9 10001 62502 6.2 0.5 for i, ti in enumerate(tim):
10 10000 1287698 128.8 10.5 x = prec * ti
11 10000 9758633 975.9 79.7 x = np.sin(x)
12 10000 1058995 105.9 8.6 x = np.sum(x)
13 10000 75760 7.6 0.6 values[i] = x
But there is another thing besides actual time spent: the memory consumption. Your loop solution requires O(n)
memory because you always process n
elements. The broadcasting solution however requires O(n*n)
memory. You probably have to wait some time if you use resolution=20000
with your loop but it would still only require 8bytes/element * 20000 element ~= 160kB
but with the broadcasting you'll need ~3GB
. And this neglects the constant factor (like temporary arrays aka intermediate arrays)! Suppose you go even further you'll run out of memory very fast!
Time to summarize the points again:
But the most important point about optimization is still:
Only optimize the code if it's too slow! If it's too slow then optimize only after profiling your code.
Don't blindly trust simplified statements and again never optimize without profiling.
One final thought:
Such functions that either require a loop or broadcasting can be easily implemented using cython, numba or numexpr if there is no already existing solution in numpy or scipy.
For example a numba function that combines the memory efficiency from the loop solution with the speed of the broadcast solution at low resolutions
would look like this:
from numba import njit
import math
@njit
def numba_solution(tim, prec, values):
size = tim.size
for i in range(size):
ti = tim[i]
x = 0
for j in range(size):
x += math.sin(prec[j] * ti)
values[i] = x
As pointed out in the comments numexpr
can also evaluate the broadcasted calculation very fast and without requiring O(n*n)
memory:
>>> import numexpr
>>> tim_2d = tim[:, np.newaxis]
>>> numexpr.evaluate('sum(sin(tim_2d * prec), axis=1)')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With