I wanted to test the performance of vectorizing code in python:
import timeit
import numpy as np
def func1():
x = np.arange(1000)
sum = np.sum(x*2)
return sum
def func2():
sum = 0
for i in xrange(1000):
sum += i*2
return sum
def func3():
sum = 0
for i in xrange(0,1000,4):
x = np.arange(i,i+4,1)
sum += np.sum(x*2)
return sum
print timeit.timeit(func1, number = 1000)
print timeit.timeit(func2, number = 1000)
print timeit.timeit(func3, number = 1000)
The code gives the following output:
0.0105729103088
0.069864988327
0.983253955841
The performance difference in the first and second functions are not surprising. But I was surprised that the 3rd function is significantly slower than the other functions.
I am much more familiar in vectorising code in C than in Python and the 3rd function is more C-like - running a for loop and processing 4 numbers in one instruction in each loop. To my understanding numpy calls a C function and then vectorize the code in C. So if this is the case my code is also passing 4 numbers to numpy each at a time. The code shouldn't perform better when I pass more numbers at once. So why is it much more slower? Is it because of the overhead in calling a numpy function?
Besides, the reason that I even came up with the 3rd function in the first place is because I'm worried about the performance of the large amount of memory allocation to x
in func1
.
Is my worry valid? Why and how can I improve it or why not?
Thanks in advance.
Edit:
For curiosity sake, although it defeats my original purpose for creating the 3rd version, I have looked into roganjosh's suggestion and tried the following edit.
def func3():
sum = 0
x = np.arange(0,1000)
for i in xrange(0,1000,4):
sum += np.sum(x[i:i+4]*2)
return sum
The output:
0.0104308128357
0.0630609989166
0.748773813248
There is an improvement, but still a large gap compared with the other functions.
Is it because x[i:i+4]
still creates a new array?
Edit 2:
I've modified the code again according to Daniel's suggestion.
def func1():
x = np.arange(1000)
x *= 2
return x.sum()
def func3():
sum = 0
x = np.arange(0,1000)
for i in xrange(0,1000,4):
x[i:i+4] *= 2
sum += x[i:i+4].sum()
return sum
The output:
0.00824999809265
0.0660569667816
0.598328828812
There is another speedup. So the declaration of numpy arrays are definitely a problem.Now in func3 there should be one array declaration only, but yet the time is still way slower. Is it because of the overhead of calling numpy arrays?
The concept of vectorized operations on NumPy allows the use of more optimal and pre-compiled functions and mathematical operations on NumPy array objects and data sequences. The Output and Operations will speed up when compared to simple non-vectorized operations.
Again, some have observed vectorize to be faster than normal for loops, but even the NumPy documentation states: “The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.”
Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values (vector) at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).
By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.
It seems you're mostly interested in the difference between your function 3 compared to the pure NumPy (function 1) and Python (function 2) approaches. The answer is quite simple (especially if you look at function 4):
You typically need several thousand elements to get in the regime where the runtime of np.sum
actually depends on the number of elements in the array. Using IPython and matplotlib (the plot is at the end of the answer) you can easily check the runtime dependency:
import numpy as np
n = []
timing_sum1 = []
timing_sum2 = []
for i in range(1, 25):
num = 2**i
arr = np.arange(num)
print(num)
time1 = %timeit -o arr.sum() # calling the method
time2 = %timeit -o np.sum(arr) # calling the function
n.append(num)
timing_sum1.append(time1)
timing_sum2.append(time2)
The results for np.sum
(shortened) are quite interesting:
4
22.6 µs ± 297 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
16
25.1 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
64
25.3 µs ± 1.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
256
24.1 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1024
24.6 µs ± 221 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
4096
27.6 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
16384
40.6 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
65536
91.2 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
262144
394 µs ± 8.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1048576
1.24 ms ± 4.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4194304
4.71 ms ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
16777216
18.6 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It seems the constant factor is roughly 20µs
on my computer) and it takes an array with 16384 thousand elements to double that time. So the timing for function 3 and 4 are mostly timing multiplicatives of the constant factor.
In function 3 you include the constant factor 2 times, once with np.sum
and once with np.arange
. In this case arange
is quite cheap because each array is the same size, so NumPy & Python & your OS probably reuse the memory of the array of the last iteration. However even that takes time (roughly 2µs
for very small arrays on my computer).
More generally: To identify bottlenecks you should always profile the functions!
I show the results for the functions with line-profiler. Therefore I altered the functions a bit so they only do one operation per line:
import numpy as np
def func1():
x = np.arange(1000)
x = x*2
return np.sum(x)
def func2():
sum_ = 0
for i in range(1000):
tmp = i*2
sum_ += tmp
return sum_
def func3():
sum_ = 0
for i in range(0, 1000, 4): # I'm using python3, so "range" is like "xrange"!
x = np.arange(i, i + 4, 1)
x = x * 2
tmp = np.sum(x)
sum_ += tmp
return sum_
def func4():
sum_ = 0
x = np.arange(1000)
for i in range(0, 1000, 4):
y = x[i:i + 4]
y = y * 2
tmp = np.sum(y)
sum_ += tmp
return sum_
Results:
%load_ext line_profiler
%lprun -f func1 func1()
Line # Hits Time Per Hit % Time Line Contents
==============================================================
4 def func1():
5 1 62 62.0 23.8 x = np.arange(1000)
6 1 65 65.0 24.9 x = x*2
7 1 134 134.0 51.3 return np.sum(x)
%lprun -f func2 func2()
Line # Hits Time Per Hit % Time Line Contents
==============================================================
9 def func2():
10 1 7 7.0 0.1 sum_ = 0
11 1001 2523 2.5 30.9 for i in range(1000):
12 1000 2819 2.8 34.5 tmp = i*2
13 1000 2819 2.8 34.5 sum_ += tmp
14 1 3 3.0 0.0 return sum_
%lprun -f func3 func3()
Line # Hits Time Per Hit % Time Line Contents
==============================================================
16 def func3():
17 1 7 7.0 0.0 sum_ = 0
18 251 909 3.6 2.9 for i in range(0, 1000, 4):
19 250 6527 26.1 21.2 x = np.arange(i, i + 4, 1)
20 250 5615 22.5 18.2 x = x * 2
21 250 16053 64.2 52.1 tmp = np.sum(x)
22 250 1720 6.9 5.6 sum_ += tmp
23 1 3 3.0 0.0 return sum_
%lprun -f func4 func4()
Line # Hits Time Per Hit % Time Line Contents
==============================================================
25 def func4():
26 1 7 7.0 0.0 sum_ = 0
27 1 49 49.0 0.2 x = np.arange(1000)
28 251 892 3.6 3.4 for i in range(0, 1000, 4):
29 250 2177 8.7 8.3 y = x[i:i + 4]
30 250 5431 21.7 20.7 y = y * 2
31 250 15990 64.0 60.9 tmp = np.sum(y)
32 250 1686 6.7 6.4 sum_ += tmp
33 1 3 3.0 0.0 return sum_
I won't go into the details of the results, but as you can see np.sum
is definetly the bottleneck in func3
and func4
. I already guessed that np.sum
is the bottleneck before I wrote the answer but these line-profilings actually verify that it is the bottleneck.
Which leads to a very important fact when using NumPy:
If you really believe some part is too slow then you can use:
But generally you probably can't beat NumPy for moderatly sized (several thousand entries and more) arrays.
%matplotlib notebook
import matplotlib.pyplot as plt
# Average time per sum-call
fig = plt.figure(1)
ax = plt.subplot(111)
ax.plot(n, [time.average for time in timing_sum1], label='arr.sum()', c='red')
ax.plot(n, [time.average for time in timing_sum2], label='np.sum(arr)', c='blue')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('elements')
ax.set_ylabel('time it takes to sum them [seconds]')
ax.grid(which='both')
ax.legend()
# Average time per element
fig = plt.figure(1)
ax = plt.subplot(111)
ax.plot(n, [time.average / num for num, time in zip(n, timing_sum1)], label='arr.sum()', c='red')
ax.plot(n, [time.average / num for num, time in zip(n, timing_sum2)], label='np.sum(arr)', c='blue')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('elements')
ax.set_ylabel('time per element [seconds / element]')
ax.grid(which='both')
ax.legend()
The plots are log-log, I think it was the best way to visualize the data given that it extends several orders of magnitude (I just hope it's still understandable).
The first plot shows how much time it takes to do the sum
:
The second plot shows the average time it takes to do the sum
divided by the number of elements in the array. This is just another way to interpret the data:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With