Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

No speedup when summing uint16 vs uint64 arrays with NumPy?

I have to do a large number of operations (additions) on relatively small integers, and I started considering which datatype would give the best performance on a 64 bit machine.

I was convinced that adding together 4 uint16 would take the same time as one uint64, since the ALU could make 4 uint16 additions using only 1 uint64 adder. (Carry propagation means this doesn't work that easily for a single 64-bit adder, but this is how integer SIMD instructions work.)

Apparently this is not the case:

In [3]: data = np.random.rand(10000)

In [4]: int16 = data.astype(np.uint16)

In [5]: int64 = data.astype(np.uint64)

In [6]: int32 = data.astype(np.uint32)

In [7]: float32 = data.astype(np.float32)

In [8]: float64 = data.astype(np.float64)

In [9]: %timeit int16.sum()
13.4 µs ± 43.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [10]: %timeit int32.sum()
13.9 µs ± 347 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [11]: %timeit int64.sum()
9.33 µs ± 47.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [12]: %timeit float32.sum()
5.79 µs ± 6.51 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [13]: %timeit float64.sum()
6 µs ± 3.54 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

All int operations take the same time (more than the float ops) with no speedup. Is this due the C implementation of numpy not being fully optimized or is there some hardware limitation that I am not aware of?

like image 880
pnjun Avatar asked Nov 27 '21 10:11

pnjun


People also ask

How can I make my NumPy code faster?

By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.

Does NumPy use SIMD?

NumPy comes with a flexible working mechanism that allows it to harness the SIMD features that CPUs own, in order to provide faster and more stable performance on all popular platforms.

What does np mod do in python?

mod() is another function for doing mathematical operations in numpy.It returns element-wise remainder of division between two array arr1 and arr2 i.e. arr1 % arr2 . It returns 0 when arr2 is 0 and both arr1 and arr2 are (arrays of) integers.

What is uint16 NumPy?

uint16 is a type object representing the type of array scalars of uint16 dtype. x. dtype is an actual dtype object, and dtype objects implement == in a weird way that's non-transitive and inconsistent with hash .

Why is NumPy faster than Python?

Looping over Python arrays, lists, or dictionaries, can be slow. Thus, vectorized operations in Numpy are mapped to highly optimized C code, making them much faster than their standard Python counterparts. Here’s the fast way to do things — by using Numpy the way it was designed to be used.

How long does it take to multiply array values in NumPy?

On the same machine, multiplying those array values by 1.0000001 in a regular floating point loop took 1.28507 seconds. What is Vectorization? Numpy is designed to be efficient with matrix operations. More specifically, most processing in Numpy is vectorized.

Are SIMD operations supported by NumPy?

The usual sense in which the word is used in a NumPy context may not be the sense you're thinking of, if you're thinking of hardware-level SIMD operations. – user2357112 supports Monica Jul 6 '17 at 9:07 Add a comment | 3 Answers 3 ActiveOldestVotes 40 Yes, they are.

Does NumPy exploit SIMD and Blas?

numpy certainly exploits vectorized by SIMD and BLAS,which can be found in its source code.the official numpy distribution supports a set of parallel manipulation(like np.dot),but not every functions(like np.where,np.mean).the blog author may choose an inappropriate function(a unvectorized function) to compare.


1 Answers

TL;DR: I made an experimental analysis on Numpy 1.21.1. Experimental results show that np.sum does NOT (really) make use of SIMD instructions: no SIMD instruction are used for integers, and scalar SIMD instructions are used for floating-point numbers! Moreover, Numpy converts the integers to 64-bits values for smaller integer types by default so to avoid overflows!


Note that this may not reflect all Numpy versions since there is an ongoing work to provide SIMD support for commonly used functions (the version Numpy 1.22.0rc1 not yet released continue this long-standing work). Moreover, the compiler or the processor used may significantly impact the results. The following experiments have been done using a Numpy retrieved from pip on a Debian Linux with a i5-9600KF processor.


Under the hood of np.sum

For floating-point numbers, Numpy uses a pairwise algorithm which is known to be quite numerically stable while being relatively fast. This can be seen in the code, but also simply using a profiler: TYPE_pairwise_sum is the C function called to compute the sum at runtime (where TYPE is DOUBLE or FLOAT).

For integers, Numpy use a classical naive reduction. The C function called is ULONG_add_avx2 on AVX2-compatible machines. It also surprisingly convert items to 64-bit ones if the type is not np.int64.

Here is the hot part of the assembly code executed by the DOUBLE_pairwise_sum function

  3,65 │ a0:┌─→add        $0x8,%rcx                  ; Loop iterator
  3,60 │    │  prefetcht0 (%r8,%rax,1)               ; Prefetch data
  9,46 │    │  vaddsd     (%rax,%rbp,1),%xmm1,%xmm1  ; Accumulate an item in xmm1[0]
  4,65 │    │  vaddsd     (%rax),%xmm0,%xmm0         ; Same for xmm0[0]
  6,91 │    │  vaddsd     (%rax,%rdx,1),%xmm4,%xmm4  ; Etc.
  7,77 │    │  vaddsd     (%rax,%rdx,2),%xmm7,%xmm7
  7,41 │    │  vaddsd     (%rax,%r10,1),%xmm2,%xmm2
  7,27 │    │  vaddsd     (%rax,%rdx,4),%xmm6,%xmm6
  6,80 │    │  vaddsd     (%rax,%r11,1),%xmm3,%xmm3
  7,46 │    │  vaddsd     (%rax,%rbx,1),%xmm5,%xmm5
  3,46 │    │  add        %r12,%rax                  ; Switch to the next items (x8)
  0,13 │    ├──cmp        %rcx,%r9                   ; Should the loop continue?
  3,27 │    └──jg         a0                         ; Jump to the beginning if so

The Numpy compiled code clearly uses the scalar SIMD instruction vaddsd (computing only a single double-precision item) although it successfully unroll the loop 8 times. The same code is generated for FLOAT_pairwise_sum: vaddss is called 8 times.

For the np.uint32, here is the hot part of the generated assembly code:

  2,37 │160:┌─→add          $0x1,%rax     ; Loop iterator
 95,95 │    │  add          (%rdi),%rdx   ; Accumulate the values in %rdx
  0,06 │    │  add          %r10,%rdi     ; Switch to the next item
       │    ├──cmp          %rsi,%rax     ; Should the loop continue?
  1,08 │    └──jne          160           ; Jump to the beginning if so

Numpy obviously does not use SIMD instructions for np.uint32 type. It does not even unroll the loop. The add (%rdi),%rdx instruction performing the accumulation is the bottleneck here due to the data dependency on the accumulator. The same loops can be seen for np.uint64(despite the name of the function isULONG_add_avx2`).

However, the np.uint32 version call the C function _aligned_contig_cast_uint_to_ulong in order to convert the integer items to a wider type. Numpy does that to avoid integer overflows. The same thing can be seen for the types np.uint8 and np.uint16 (although the name of the function differs). Hopefully, this function makes use of SIMD instructions (SSE) but still takes a significant portion of the execution time (~30% of the np.sum time).

EDIT: as pointed out by @user2357112supportsMonica, the dtype parameter of np.sum can be explicitly specified. When it match with the dtype of the input array, then the conversion is not performed. This results in a smaller execution time at the expense of a possible overflow.


Benchmark results

Here is the result on my machine:

uint16: 7.17 µs ± 80 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
uint32: 7.11 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
uint64: 5.05 µs ± 8.57 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
float32: 2.88 µs ± 9.27 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
float64: 3.06 µs ± 10.6 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)

First of all, not that the results are very similar to the ones provided in the question meaning that the behavior seen on my machine can be successfully reproduced on other one. Thus, the explanation should also be consistent.

As you can see, the 64-bit version is faster than the other integer-based versions. This is due to the overhead of the conversion. The two first are equally+fast because of the scalar loop and the add instruction being equally-fast for 8-bit, 16-bit and 32-bit integers (this should be true for most 64-bit mainstream platforms). The integer implementations are slower than the floating-point ones because of the lack of (proper) loop unrolling.

The floating-point implementations are equally-fast due to the scalar SIMD instructions. Indeed, the instructions vaddss (for np.float32) and vaddsd (for np.float64) have the same latency and throughput (at least on all modern Intel processors). Thus the throughput of the two implementation is the same since the loop of the two implementation is similar (same unrolling).


Additional notes

This is sad np.sum does not fully make use of SIMD instructions as this would speed up a lot the computations using it (especially small integers).

[UPDATE] Looking at the Numpy code, I discovered that the code is not vectorized because the array stride is a runtime value and the compiler do not generate a specialized version where the stride is 1. In fact, this can be partially seen in the previous assembly code: the compiler used the instruction add %r10, %rdi because %r10 (the stride of the target array) is not known at compile-time. There is currently no optimization for this specific case of reduction in the Numpy code yet (the functions are relatively generic). This may change in a near future.

In addition to the stride problem, one big point makes it hard for compilers to automatically vectorize a code: the floating-point addition is not commutative, nor associative (unless flags like -ffast-math are used).

like image 60
Jérôme Richard Avatar answered Oct 28 '22 22:10

Jérôme Richard