No speedup when summing uint16 vs uint64 arrays with NumPy?

Tags:

I have to do a large number of operations (additions) on relatively small integers, and I started considering which datatype would give the best performance on a 64 bit machine.

I was convinced that adding together 4 uint16 would take the same time as one uint64, since the ALU could make 4 uint16 additions using only 1 uint64 adder. (Carry propagation means this doesn't work that easily for a single 64-bit adder, but this is how integer SIMD instructions work.)

Apparently this is not the case:

In [3]: data = np.random.rand(10000)

In [4]: int16 = data.astype(np.uint16)

In [5]: int64 = data.astype(np.uint64)

In [6]: int32 = data.astype(np.uint32)

In [7]: float32 = data.astype(np.float32)

In [8]: float64 = data.astype(np.float64)

In [9]: %timeit int16.sum()
13.4 µs ± 43.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [10]: %timeit int32.sum()
13.9 µs ± 347 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [11]: %timeit int64.sum()
9.33 µs ± 47.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [12]: %timeit float32.sum()
5.79 µs ± 6.51 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [13]: %timeit float64.sum()
6 µs ± 3.54 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

All int operations take the same time (more than the float ops) with no speedup. Is this due the C implementation of numpy not being fully optimized or is there some hardware limitation that I am not aware of?

880

asked Nov 27 '21 10:11

pnjun

1 Answers

TL;DR: I made an experimental analysis on Numpy 1.21.1. Experimental results show that np.sum does NOT (really) make use of SIMD instructions: no SIMD instruction are used for integers, and scalar SIMD instructions are used for floating-point numbers! Moreover, Numpy converts the integers to 64-bits values for smaller integer types by default so to avoid overflows!

Note that this may not reflect all Numpy versions since there is an ongoing work to provide SIMD support for commonly used functions (the version Numpy 1.22.0rc1 not yet released continue this long-standing work). Moreover, the compiler or the processor used may significantly impact the results. The following experiments have been done using a Numpy retrieved from pip on a Debian Linux with a i5-9600KF processor.

Under the hood of `np.sum`

For floating-point numbers, Numpy uses a pairwise algorithm which is known to be quite numerically stable while being relatively fast. This can be seen in the code, but also simply using a profiler: TYPE_pairwise_sum is the C function called to compute the sum at runtime (where TYPE is DOUBLE or FLOAT).

For integers, Numpy use a classical naive reduction. The C function called is ULONG_add_avx2 on AVX2-compatible machines. It also surprisingly convert items to 64-bit ones if the type is not np.int64.

Here is the hot part of the assembly code executed by the DOUBLE_pairwise_sum function

  3,65 │ a0:┌─→add        $0x8,%rcx                  ; Loop iterator
  3,60 │    │  prefetcht0 (%r8,%rax,1)               ; Prefetch data
  9,46 │    │  vaddsd     (%rax,%rbp,1),%xmm1,%xmm1  ; Accumulate an item in xmm1[0]
  4,65 │    │  vaddsd     (%rax),%xmm0,%xmm0         ; Same for xmm0[0]
  6,91 │    │  vaddsd     (%rax,%rdx,1),%xmm4,%xmm4  ; Etc.
  7,77 │    │  vaddsd     (%rax,%rdx,2),%xmm7,%xmm7
  7,41 │    │  vaddsd     (%rax,%r10,1),%xmm2,%xmm2
  7,27 │    │  vaddsd     (%rax,%rdx,4),%xmm6,%xmm6
  6,80 │    │  vaddsd     (%rax,%r11,1),%xmm3,%xmm3
  7,46 │    │  vaddsd     (%rax,%rbx,1),%xmm5,%xmm5
  3,46 │    │  add        %r12,%rax                  ; Switch to the next items (x8)
  0,13 │    ├──cmp        %rcx,%r9                   ; Should the loop continue?
  3,27 │    └──jg         a0                         ; Jump to the beginning if so

The Numpy compiled code clearly uses the scalar SIMD instruction vaddsd (computing only a single double-precision item) although it successfully unroll the loop 8 times. The same code is generated for FLOAT_pairwise_sum: vaddss is called 8 times.

For the np.uint32, here is the hot part of the generated assembly code:

  2,37 │160:┌─→add          $0x1,%rax     ; Loop iterator
 95,95 │    │  add          (%rdi),%rdx   ; Accumulate the values in %rdx
  0,06 │    │  add          %r10,%rdi     ; Switch to the next item
       │    ├──cmp          %rsi,%rax     ; Should the loop continue?
  1,08 │    └──jne          160           ; Jump to the beginning if so

Numpy obviously does not use SIMD instructions for np.uint32 type. It does not even unroll the loop. The add (%rdi),%rdx instruction performing the accumulation is the bottleneck here due to the data dependency on the accumulator. The same loops can be seen for np.uint64(despite the name of the function isULONG_add_avx2`).

However, the np.uint32 version call the C function _aligned_contig_cast_uint_to_ulong in order to convert the integer items to a wider type. Numpy does that to avoid integer overflows. The same thing can be seen for the types np.uint8 and np.uint16 (although the name of the function differs). Hopefully, this function makes use of SIMD instructions (SSE) but still takes a significant portion of the execution time (~30% of the np.sum time).

EDIT: as pointed out by @user2357112supportsMonica, the dtype parameter of np.sum can be explicitly specified. When it match with the dtype of the input array, then the conversion is not performed. This results in a smaller execution time at the expense of a possible overflow.

Benchmark results

Here is the result on my machine:

uint16: 7.17 µs ± 80 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
uint32: 7.11 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
uint64: 5.05 µs ± 8.57 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
float32: 2.88 µs ± 9.27 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
float64: 3.06 µs ± 10.6 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)

First of all, not that the results are very similar to the ones provided in the question meaning that the behavior seen on my machine can be successfully reproduced on other one. Thus, the explanation should also be consistent.

As you can see, the 64-bit version is faster than the other integer-based versions. This is due to the overhead of the conversion. The two first are equally+fast because of the scalar loop and the add instruction being equally-fast for 8-bit, 16-bit and 32-bit integers (this should be true for most 64-bit mainstream platforms). The integer implementations are slower than the floating-point ones because of the lack of (proper) loop unrolling.

The floating-point implementations are equally-fast due to the scalar SIMD instructions. Indeed, the instructions vaddss (for np.float32) and vaddsd (for np.float64) have the same latency and throughput (at least on all modern Intel processors). Thus the throughput of the two implementation is the same since the loop of the two implementation is similar (same unrolling).

Additional notes

This is sad np.sum does not fully make use of SIMD instructions as this would speed up a lot the computations using it (especially small integers).

[UPDATE] Looking at the Numpy code, I discovered that the code is not vectorized because the array stride is a runtime value and the compiler do not generate a specialized version where the stride is 1. In fact, this can be partially seen in the previous assembly code: the compiler used the instruction add %r10, %rdi because %r10 (the stride of the target array) is not known at compile-time. There is currently no optimization for this specific case of reduction in the Numpy code yet (the functions are relatively generic). This may change in a near future.

In addition to the stride problem, one big point makes it hard for compilers to automatically vectorize a code: the floating-point addition is not commutative, nor associative (unless flags like -ffast-math are used).

answered Oct 28 '22 22:10

Jérôme Richard

Related questions
                            
                                Manually trigger an event in discord.py
                            
                                Why does assigning to the variable after the `print` statement in a local class definition change the printed value? [duplicate]
                            
                                How to normalize and create similarity matrix in Pyspark?
                            
                                How to get the pivot lines from two pairs of files in Python?
                            
                                Is there a way to format tooltip values in Altair boxplot
                            
                                Is it possible to run manim programmatically (and not from the command line)?
                            
                                Where does the id field, eg "wound-transition" in a jupyter notebook come from and what does it mean?
                            
                                Shuffle patches in image batch
                            
                                when i try to import facebook prophet: pip install fbprophet ... i alway get this error
                            
                                Set decimal precision of a pandas dataframe column with a datatype of Decimal
                            
                                Building wheels for torch_sparse in Colab takes forever
                            
                                How to prevent turtle from moving in opposite direction
                            
                                How to specify platforms-specific extras or version in Pipfile?
                            
                                FastAPI - Pydantic - Value Error Raises Internal Server Error
                            
                                File "PyInstaller/loader/pyimod03_importers.py", line 546, in exec_module, ModuleNotFoundError: No module named 'x'
                            
                                How to read complex data in python?
                            
                                How to read & decode Secure QR code on Indian Aadhaar Card image
                            
                                How to make itertools combinations 'increase' evenly?
                            
                                ImportError: cannot import name 'BatchNormalization' from 'keras.layers.normalization'
                            
                                How do I speed up all_pairs_dijkstra_path_length

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

No speedup when summing uint16 vs uint64 arrays with NumPy?

Tags:

performance

python

compiler-optimization

numpy

simd

pnjun

People also ask

1 Answers

Under the hood of `np.sum`

Benchmark results

Additional notes

Jérôme Richard

Recent Activity

Donate For Us

No speedup when summing uint16 vs uint64 arrays with NumPy?

Tags:

performance

python

compiler-optimization

numpy

simd

pnjun

People also ask

1 Answers

Under the hood of np.sum

Benchmark results

Additional notes

Jérôme Richard

Related questions

Recent Activity

Donate For Us

Under the hood of `np.sum`