I would like to understand a strange behavior of python. Let us consider a matrix <code>M</code>with shape <code>6000 x 2000</code>. This matrix is filled with signed integers. I want to compute <code>np.transpose(M)*M</code>. Two options: <ul> <li>When I do it "naturally" (i.e. without specifying any typing), numpy selects the type <code>np.int32</code> and the operation takes around 150s.</li> <li>When I force the type to be <code>np.float64</code> (using <code>dtype=...</code>), the same operation takes around 2s.</li> </ul> How can we explain this behavior ? I was naively thinking that a int multiplication was cheaper than a float multiplication. Thanks a lot for your help.

No, integer multiplies aren't cheaper. But more on that later. Most likely (I am 99% sure) <code>numpy</code> calls <code>BLAS</code> routine under blankets, which can be as efficient as 90% of peak CPU performance. There aren't special provisions for <code>int</code> matrix multiplies, most likely it is done in Python rather than machine-compiled version - I am actually wrong on this, see below. With regards to <code>int</code> vs <code>float</code> speed: on most architectures (Intel) they are roughly the same, around 3-5 cycles or so per instruction, both have serial (X87) and vector (XMM) version. On Sandy bridge, <code>PMUL***</code> (integer vector multiply) is 5 cycles and so are the <code>MULP*</code> (floating multiplies). With Sandy Bridge you also have 256-bit SIMD vectors ops (YMM) - you get 8 <code>float</code> ops per instructions - I am not sure if there is an <code>int</code> counterpart. This here is a great reference: http://www.agner.org/optimize/instruction_tables.pdf That said, instruction latencies don't explain 75X speed difference. It is probably a combination of optimized BLAS (threaded probably) and int32 being handled in Python rather than C/Fortran. I profiled following snippet: <pre class="prettyprint"><code>>>> F = (np.random.random((6000,2000))+4) >>> I = F.astype(np.int32) >>> np.dot(F, F.transpose()); np.dot(I, I.transpose()) </code></pre> and this is what oprofile says: <pre class="prettyprint"><code>CPU_CLK_UNHALT...| samples| %| ------------------ 2076880 51.5705 multiarray.so 1928787 47.8933 libblas.so.3.0 </code></pre> However the libblas is unoptimized serial Netlib Blas. With a good BLAS implementation that 47% will be much lower, especially if it is threaded. Edit: It seems numpy does provide compiled version of integer matrix multiply.

Python Numpy : np.int32 "slower" than np.float64

Tags:

python

floating-point

numpy

int32

I would like to understand a strange behavior of python. Let us consider a matrix Mwith shape 6000 x 2000. This matrix is filled with signed integers. I want to compute np.transpose(M)*M. Two options:

When I do it "naturally" (i.e. without specifying any typing), numpy selects the type np.int32 and the operation takes around 150s.
When I force the type to be np.float64 (using dtype=...), the same operation takes around 2s.

How can we explain this behavior ? I was naively thinking that a int multiplication was cheaper than a float multiplication.

Thanks a lot for your help.

694

asked Sep 11 '13 14:09

ThR37

2 Answers

No, integer multiplies aren't cheaper. But more on that later. Most likely (I am 99% sure) numpy calls BLAS routine under blankets, which can be as efficient as 90% of peak CPU performance. There aren't special provisions for int matrix multiplies, most likely it is done in Python rather than machine-compiled version - I am actually wrong on this, see below.

With regards to int vs float speed: on most architectures (Intel) they are roughly the same, around 3-5 cycles or so per instruction, both have serial (X87) and vector (XMM) version. On Sandy bridge, PMUL*** (integer vector multiply) is 5 cycles and so are the MULP* (floating multiplies). With Sandy Bridge you also have 256-bit SIMD vectors ops (YMM) - you get 8 float ops per instructions - I am not sure if there is an int counterpart.

This here is a great reference: http://www.agner.org/optimize/instruction_tables.pdf

That said, instruction latencies don't explain 75X speed difference. It is probably a combination of optimized BLAS (threaded probably) and int32 being handled in Python rather than C/Fortran.

I profiled following snippet:

Click to copy

>>> F = (np.random.random((6000,2000))+4)
>>> I = F.astype(np.int32)
>>> np.dot(F, F.transpose()); np.dot(I, I.transpose())

and this is what oprofile says:

Click to copy

CPU_CLK_UNHALT...|
  samples|      %|
------------------
  2076880 51.5705 multiarray.so
  1928787 47.8933 libblas.so.3.0

However the libblas is unoptimized serial Netlib Blas. With a good BLAS implementation that 47% will be much lower, especially if it is threaded.

Edit: It seems numpy does provide compiled version of integer matrix multiply.

184

answered Oct 28 '22 01:10

Anycorn

Some supplemental information that I found through experimentation.

This can be circumvented. Timings are on a intel cpu with intel mkl for BLAS. Im also using fortran ordered arrays to keep everything equivalent a the scipy.linalg.blas is the fortran BLAS.

Lets take the following example:

Click to copy

from scipy.linalg.blas import sgemm
from scipy.linalg.blas import dgemm

arr_int64 = np.random.randint(-500,500,(6000,2000))

arr_int32 = array_int64.astype(np.int32)

arr_float64 = array_int64.astype(np.float64)+np.random.rand(6000,2000)

arr_float32 = array_int64.astype(np.float32)

First lets take the DGEMM calls.

Click to copy

%timeit np.dot(arr_float64.T,arr_float64) #400% CPU threaded BLAS
1 loops, best of 3: 969 ms per loop

%timeit np.dot(arr_float32.T,arr_float32) #400% CPU threaded BLAS
1 loops, best of 3: 513 ms per loop

%timeit np.dot(arr_int64.T,arr_int64)     #100% CPU?
1 loops, best of 3: 24.7 s per loop

%timeit np.dot(arr_int32.T,arr_int32)     #100% CPU?
1 loops, best of 3: 21.3 s per loop

Calling DGEMM/SGEMM directly:

Click to copy

%timeit dgemm(alpha=1, a=arr_float64, b=arr_float64, trans_a=True)
1 loops, best of 3: 1.13 s per loop

%timeit dgemm(alpha=1, a=arr_int64, b=arr_int64, trans_a=True)
1 loops, best of 3: 869 ms per loop

%timeit sgemm(alpha=1, a=arr_float32, b=arr_float32, trans_a=True)
1 loops, best of 3: 657 ms per loop

%timeit sgemm(alpha=1, a=arr_int32, b=arr_int32, trans_a=True)
1 loops, best of 3: 432 ms per loop

np.allclose( np.dot(arr_int32.T,arr_int32), sgemm(alpha=1, a=arr_int32, b=arr_int32, trans_a=True))
#True

Looks like something strange going on in the np.dot call. Similar to naive algorithm speed:

Click to copy

%timeit np.einsum('ij,jk',arr_int32.T,arr_int32)
1 loops, best of 3: 14.1 s per loop

%timeit np.einsum('ij,jk',arr_int64.T,arr_int64)
1 loops, best of 3: 26 s per loop

answered Oct 28 '22 02:10

Daniel

Related questions
                            
                                Does Python support ++? [duplicate]
                            
                                Is there a python version for the JVM based metrics library
                            
                                backslash in Yaml string [duplicate]
                            
                                Is it preferable to use an "else" in Python when it's not necessary?
                            
                                Print HTTP request in Python Django
                            
                                How to call a parent class's @classmethod from an overridden @classmethod in Python?
                            
                                Multiplicative combination algorithm
                            
                                Improving error messages with pyparsing
                            
                                What is a thread-specific os.chdir and mkdir in python?
                            
                                IntegrityError: null value in column "city_id " violates not-null constraint
                            
                                Group HITs into batches in amazon mechanical turk with boto?
                            
                                pandas merge with dataframes of different frequency (hourly and daily)
                            
                                python Tornado websockets how to send message every X seconds?
                            
                                imread in pylab vs opencv: returning completely different array values
                            
                                TypeError: can only concatenate list (not "int") to list in python
                            
                                How can I use external variables in Python like 'extern int x;' in C?
                            
                                Resolving a vim plugin mapping conflict - mapping already exists for \t
                            
                                How to reorder a python ordered dict based on array?
                            
                                How to call tempfile.mkstemp() with "with"? - or why doesn't it return an fd with __exit__()?
                            
                                Python argparse: Combine optional parameters with nargs=argparse.REMAINDER

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Numpy : np.int32 "slower" than np.float64

Tags:

python

floating-point

numpy

int32

ThR37

People also ask

2 Answers

Anycorn

Daniel

Recent Activity

Donate For Us