I am writing a time consuming program. To reduce the time, I have tried my best to use numpy.dot
instead of for
loops.
However, I found vectorized program to have much worse performance than the for loop version:
import numpy as np
import datetime
kpt_list = np.zeros((10000,20),dtype='float')
rpt_list = np.zeros((1000,20),dtype='float')
h_r = np.zeros((20,20,1000),dtype='complex')
r_ndegen = np.zeros(1000,dtype='float')
r_ndegen.fill(1)
# setup completed
# this is a the vectorized version
r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
start = datetime.datetime.now()
phase = np.exp(1j * np.dot(rpt_list, kpt_list.T))/r_ndegen_tile
kpt_data_1 = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 19.302483
# this is the for loop version
kpt_data_2 = np.zeros((20, 20, 10000), dtype='complex')
start = datetime.datetime.now()
for i in range(10000):
kpt = kpt_list[i, :]
phase = np.exp(1j * np.dot(kpt, rpt_list.T))/r_ndegen
kpt_data_2[:, :, i] = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 7.74583
What is happening here?
Again, some have observed vectorize to be faster than normal for loops, but even the NumPy documentation states: “The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.”
Vectorization is a type of parallel processing. It enables more computer hardware to be devoted to performing the computation, so the computation is done faster.
Vectorization in Python, as implemented by NumPy, can give you faster operations by using fast, low-level code to operate on bulk data. And Pandas builds on NumPy to provide similarly fast functionality.
The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy. The data type of the output of vectorized is determined by calling the function with the first element of the input.
The first thing I suggest you do is break your script down into separate functions to make profiling and debugging easier:
def setup(n1=10000, n2=1000, n3=20, seed=None):
gen = np.random.RandomState(seed)
kpt_list = gen.randn(n1, n3).astype(np.float)
rpt_list = gen.randn(n2, n3).astype(np.float)
h_r = (gen.randn(n3, n3,n2) + 1j*gen.randn(n3, n3,n2)).astype(np.complex)
r_ndegen = gen.randn(1000).astype(np.float)
return kpt_list, rpt_list, h_r, r_ndegen
def original_vec(*args, **kwargs):
kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
kpt_data = h_r.dot(phase)
return kpt_data
def original_loop(*args, **kwargs):
kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
kpt_data = np.zeros((20, 20, 10000), dtype='complex')
for i in range(10000):
kpt = kpt_list[i, :]
phase = np.exp(1j * np.dot(kpt, rpt_list.T)) / r_ndegen
kpt_data[:, :, i] = h_r.dot(phase)
return kpt_data
I would also highly recommend using random data rather than all-zero or all-one arrays, unless that's what your actual data looks like (!). This makes it much easier to check the correctness of your code - for example, if your last step is to multiply by a matrix of zeros then your output will always be all-zeros, regardless of whether or not there is a mistake earlier on in your code.
Next, I would run these functions through line_profiler
to see where they are spending most of their time. In particular, for original_vec
:
In [1]: %lprun -f original_vec original_vec()
Timer unit: 1e-06 s
Total time: 23.7598 s
File: <ipython-input-24-c57463f84aad>
Function: original_vec at line 12
Line # Hits Time Per Hit % Time Line Contents
==============================================================
12 def original_vec(*args, **kwargs):
13
14 1 86498 86498.0 0.4 kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
15
16 1 69700 69700.0 0.3 r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
17 1 1331947 1331947.0 5.6 phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
18 1 22271637 22271637.0 93.7 kpt_data = h_r.dot(phase)
19
20 1 4 4.0 0.0 return kpt_data
You can see that it spends 93% of its time computing the dot product between h_r
and phase
. Here, h_r
is a (20, 20, 1000)
array and phase
is (1000, 10000)
. We're computing a sum product over the last dimension of h_r
and the first dimension of phase
(you could write this in einsum
notation as ijk,kl->ijl
).
The first two dimensions of h_r
don't really matter here - we could just as easily reshape h_r
into a (20*20, 1000)
array before taking the dot product. It turns out that this reshaping operation by itself gives a huge performance improvement:
In [2]: %timeit h_r.dot(phase)
1 loop, best of 3: 22.6 s per loop
In [3]: %timeit h_r.reshape(-1, 1000).dot(phase)
1 loop, best of 3: 1.04 s per loop
I'm not entirely sure why this should be the case - I would have hoped that numpy's dot
function would be smart enough to apply this simple optimization automatically. On my laptop the second case seems to use multiple threads whereas the first one doesn't, suggesting that it might not be calling multithreaded BLAS routines.
Here's a vectorized version that incorporates the reshaping operation:
def new_vec(*args, **kwargs):
kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen[:, None]
kpt_data = h_r.reshape(-1, phase.shape[0]).dot(phase)
return kpt_data.reshape(h_r.shape[:2] + (-1,))
The -1
indices tell numpy to infer the size of those dimensions according to the other dimensions and the number of elements in the array. I've also used broadcasting to divide by r_ndegen
, which eliminates the need for np.tile
.
By using the same random input data, we can check that the new version gives the same result as the original:
In [4]: ans1 = original_loop(seed=0)
In [5]: ans2 = new_vec(seed=0)
In [6]: np.allclose(ans1, ans2)
Out[6]: True
Some performance benchmarks:
In [7]: %timeit original_loop()
1 loop, best of 3: 13.5 s per loop
In [8]: %timeit original_vec()
1 loop, best of 3: 24.1 s per loop
In [5]: %timeit new_vec()
1 loop, best of 3: 2.49 s per loop
I was curious about why np.dot
was so much slower for the original (20, 20, 1000)
h_r
array, so I dug into the numpy source code. The logic implemented in multiarraymodule.c
turns out to be shockingly simple:
#if defined(HAVE_CBLAS)
if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
(NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
return cblas_matrixproduct(typenum, ap1, ap2, out);
}
#endif
In other words numpy just checks whether either of the input arrays has > 2 dimensions, and immediately falls back on a non-BLAS implementation of matrix-matrix multiplication. It seems like it shouldn't be too difficult to check whether the inner dimensions of the two arrays are compatible, and if so treat them as 2D and perform *gemm
matrix-matrix multiplication on them. In fact there's an open feature request for this dating back to 2012, if any numpy devs are reading...
In the meantime, it's a nice performance trick to be aware of when multiplying tensors.
I forgot about np.tensordot
. Since it calls the same underlying BLAS routines as np.dot
on a 2D array, it can achieve the same performance bump, but without all those ugly reshape
operations:
In [6]: %timeit np.tensordot(h_r, phase, axes=1)
1 loop, best of 3: 1.05 s per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With