I've got a numpy
script that spends about 50% of its runtime in the following code:
s = numpy.dot(v1, v1)
where
v1 = v[1:]
and v
is a 4000-element 1D ndarray
of float64
stored in contiguous memory (v.strides
is (8,)
).
Any suggestions for speeding this up?
edit This is on Intel hardware. Here is the output of my numpy.show_config()
:
atlas_threads_info:
libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/local/atlas-3.9.16/lib']
language = f77
include_dirs = ['/usr/local/atlas-3.9.16/include']
blas_opt_info:
libraries = ['ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/local/atlas-3.9.16/lib']
define_macros = [('ATLAS_INFO', '"\\"3.9.16\\""')]
language = c
include_dirs = ['/usr/local/atlas-3.9.16/include']
atlas_blas_threads_info:
libraries = ['ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/local/atlas-3.9.16/lib']
language = c
include_dirs = ['/usr/local/atlas-3.9.16/include']
lapack_opt_info:
libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/local/atlas-3.9.16/lib']
define_macros = [('ATLAS_INFO', '"\\"3.9.16\\""')]
language = f77
include_dirs = ['/usr/local/atlas-3.9.16/include']
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE
By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.
however then again, the use of Numpy and/or Scipy with Python may additionally get your code to run just as fast as a native C++ software (in which the code in C++ could sometimes take longer to broaden).
matmul differs from dot in two important ways. Multiplication by scalars is not allowed. Stacks of matrices are broadcast together as if the matrices were elements.
Your arrays are not very big, so ATLAS probably isn't doing much. What are your timings for the following Fortran program? Assuming ATLAS isn't doing much, this should give you a sense of how fast dot() could be if there was not any python overhead. With gfortran -O3 I get speeds of 5 +/- 0.5 us.
program test
real*8 :: x(4000), start, finish, s
integer :: i, j
integer,parameter :: jmax = 100000
x(:) = 4.65
s = 0.
call cpu_time(start)
do j=1,jmax
s = s + dot_product(x, x)
enddo
call cpu_time(finish)
print *, (finish-start)/jmax * 1.e6, s
end program test
Perhaps the culprit is copying of the arrays passed to dot.
As Sven said, the dot product relies on BLAS operations. These operations require arrays stored in contiguous C order. If both arrays passed to dot are in C_CONTIGUOUS, you ought to see better performance.
Of course, if your two arrays passed to dot are indeed 1D (8,) then you should see both the C_CONTIGUOUS AND F_CONTIGUOUS flags set to True; but if they are (1, 8), then you can see mixed order.
>>> w = NP.random.randint(0, 10, 100).reshape(100, 1)
>>> w.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
An alternative: use _GEMM from BLAS, which is exposed through the module, scipy.linalg.fblas. (The two arrays, A and B, are obviously in Fortran order because fblas is used.)
from scipy.linalg import fblas as FB
X = FB.dgemm(alpha=1., a=A, b=B, trans_b=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With