I would like to understand a strange behavior of python.
Let us consider a matrix Mwith shape 6000 x 2000. This matrix is filled with signed integers. I want to compute np.transpose(M)*M. Two options:
np.int32 and the operation takes around 150s.np.float64 (using dtype=...), the same operation takes around 2s.How can we explain this behavior ? I was naively thinking that a int multiplication was cheaper than a float multiplication.
Thanks a lot for your help.
float64 is much slower than Python's float, and numpy. float32 is even slower (even though I'm on a 32-bit machine).
Because the Numpy array is densely packed in memory due to its homogeneous type, it also frees the memory faster. So overall a task executed in Numpy is around 5 to 100 times faster than the standard python list, which is a significant leap in terms of speed.
Python's floating-point numbers are usually 64-bit floating-point numbers, nearly equivalent to np.
float64 are numpy specific 32 and 64-bit float types. Thus, when you do isinstance(2.0, np. float) , it is equivalent to isinstance(2.0, float) as 2.0 is a plain python built-in float type... and not the numpy type. isinstance(np.
No, integer multiplies aren't cheaper.  But more on that later.
Most likely (I am 99% sure) numpy calls BLAS routine under blankets, which can be as efficient as 90% of peak CPU performance.  There aren't special provisions for int matrix multiplies, most likely it is done in Python rather than machine-compiled version - I am actually wrong on this, see below.
With regards to int vs float speed:  on most architectures (Intel) they are roughly the same, around 3-5 cycles or so per instruction, both have serial (X87) and vector (XMM) version.  On Sandy bridge, PMUL*** (integer vector multiply) is 5 cycles and so are the MULP* (floating multiplies).   With Sandy Bridge you also have 256-bit SIMD vectors ops (YMM) - you get 8 float ops per instructions - I am not sure if there is an int counterpart.
This here is a great reference: http://www.agner.org/optimize/instruction_tables.pdf
That said, instruction latencies don't explain 75X speed difference. It is probably a combination of optimized BLAS (threaded probably) and int32 being handled in Python rather than C/Fortran.
I profiled following snippet:
>>> F = (np.random.random((6000,2000))+4)
>>> I = F.astype(np.int32)
>>> np.dot(F, F.transpose()); np.dot(I, I.transpose())
and this is what oprofile says:
CPU_CLK_UNHALT...|
  samples|      %|
------------------
  2076880 51.5705 multiarray.so
  1928787 47.8933 libblas.so.3.0
However the libblas is unoptimized serial Netlib Blas. With a good BLAS implementation that 47% will be much lower, especially if it is threaded.
Edit: It seems numpy does provide compiled version of integer matrix multiply.
Some supplemental information that I found through experimentation.
This can be circumvented. Timings are on a intel cpu with intel mkl for BLAS. Im also using fortran ordered arrays to keep everything equivalent a the scipy.linalg.blas is the fortran BLAS.
Lets take the following example:
from scipy.linalg.blas import sgemm
from scipy.linalg.blas import dgemm
arr_int64 = np.random.randint(-500,500,(6000,2000))
arr_int32 = array_int64.astype(np.int32)
arr_float64 = array_int64.astype(np.float64)+np.random.rand(6000,2000)
arr_float32 = array_int64.astype(np.float32)
First lets take the DGEMM calls.
%timeit np.dot(arr_float64.T,arr_float64) #400% CPU threaded BLAS
1 loops, best of 3: 969 ms per loop
%timeit np.dot(arr_float32.T,arr_float32) #400% CPU threaded BLAS
1 loops, best of 3: 513 ms per loop
%timeit np.dot(arr_int64.T,arr_int64)     #100% CPU?
1 loops, best of 3: 24.7 s per loop
%timeit np.dot(arr_int32.T,arr_int32)     #100% CPU?
1 loops, best of 3: 21.3 s per loop
Calling DGEMM/SGEMM directly:
%timeit dgemm(alpha=1, a=arr_float64, b=arr_float64, trans_a=True)
1 loops, best of 3: 1.13 s per loop
%timeit dgemm(alpha=1, a=arr_int64, b=arr_int64, trans_a=True)
1 loops, best of 3: 869 ms per loop
%timeit sgemm(alpha=1, a=arr_float32, b=arr_float32, trans_a=True)
1 loops, best of 3: 657 ms per loop
%timeit sgemm(alpha=1, a=arr_int32, b=arr_int32, trans_a=True)
1 loops, best of 3: 432 ms per loop
np.allclose( np.dot(arr_int32.T,arr_int32), sgemm(alpha=1, a=arr_int32, b=arr_int32, trans_a=True))
#True
Looks like something strange going on in the np.dot call. Similar to naive algorithm speed:
%timeit np.einsum('ij,jk',arr_int32.T,arr_int32)
1 loops, best of 3: 14.1 s per loop
%timeit np.einsum('ij,jk',arr_int64.T,arr_int64)
1 loops, best of 3: 26 s per loop
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With