speed of elementary mathematical operations in Numpy/Python: why is integer division slowest?

Tags:

EDIT2: As @ShadowRanger pointed out, this is a Numpy phenomenon, not Python. But when doing the calculations in Python with list comprehensions (so x+y becomes [a+b for a,b in zip(x,y)]) then all arithmetic operations still take equally long (although more than 100x as long as Numpy). However, when I use integer division in my real simulations, they run way faster. So the main question remains: even in Python, why do these tests suggest integer division isn't faster than regular division?

EDIT1: versions: Python 3.5.5, Numpy 1.15.0.

It seems that in ~~Python~~Numpy, integer division is more expensive than regular division (of integers), which is counter-intuitive. When testing, I get this:

setup_string = 'import numpy as np;\
                N=int(1e5);\
                x=np.arange(1,N+1, dtype=int);\
                y=np.arange(N, dtype=int);'

addition (+) ~ 0.1s

timeit("x+y", setup=setup_string, number=int(1e3))
0.09872294100932777

subtraction (-) ~ 0.1s

timeit("x-y", setup=setup_string, number=int(1e3))
0.09425603999989107

multiplication (*) ~ 0.1s

timeit("x*y", setup=setup_string, number=int(1e3))
0.09888673899695277

division (/) ~ 0.35s

timeit("x/y", setup=setup_string, number=int(1e3))
0.3574664070038125

integer division (//) ~ 1s (!)

timeit("x//y", setup=setup_string, number=int(1e3))
1.006298642983893

Any ideas why this is happening? Why is integer division not faster?

653

asked Aug 02 '19 11:08

PDiracDelta

1 Answers

Short answer: Floating point division is cheaper at the hardware level than integer division. And on at least one common architecture, floating point division can be vectorized, while integer division cannot, so the most expensive operation in your code must be performed more times, and higher cost per operation for integer math, while floating point math does it fewer times, at lower cost per operation.

Long answer: numpy uses vectorized math when available, and the x86-64 architecture (which I'm guessing you're using) doesn't provide a SIMD instruction for integer division. It only provides vectorized multiplication for integers (via the PMULUDQ family of instructions), but provides both multiplication (MULPD family) and division (DIVPD family) for floating point.

When you use / for true divide, the result type is float64, not int64, and numpy could perform the operation by a single packed load and convert (with the VCVTQQ2PD family of operations, followed by a packed divide, followed by a packed move back to memory (MOVAPD family).

On the most modern x86-64 chips with AVX512 (Xeon Phi x200+ and Skylake-X and higher, the latter being available to the desktop market since late 2017), each such vectorized instruction can perform eight operations at once (older architectures from post-2011 can do four with AVX, and before that you could do two with SSE2). For /, that means you only need to issue two VCVTQQ2PDs (one from each source array), one VDIVPD and one VMOVAPD (all EVEX prefixed for 512 bit operation) for every eight divisions to be performed. By contrast, for // to perform the same eight divisions it needs to issue eight MOVs from memory (to load the left array operands), eight CQOs (to sign extend the left array operands to the 128 bit values IDIV needs), eight IDIVs (which loads from the right side array for you at least), and eight MOVs back to memory.

I don't know if numpy takes full advantage of this (my own copy is clearly compiled for the SSE2 baseline all x86-64 machines provide, so it only does two divisions at once, not eight), but it's possible, where there is no way to vectorize the equivalent integer operations at all.

While the individual instructions for the integer case are often a little cheaper, they're basically always more expensive than the combined equivalent instructions. And for integer division, it's actually worse for a single operation than floating point division for the packed operation; per Agner Fog's Skylake-X table, the cost per IDIV is 24-90 cycles, with a latency of 42-95; the cost of VDIVPD with all 512 bit registers is 16 cycles, with a latency of 24 cycles. VDIVPD isn't just doing eight times the work, it's doing it in (at most) two thirds the cycles required by IDIV (I don't know why IDIV has such a great range for cycle count, but VDIVPD beats even the best number for IDIV). For plain AVX operations (only four divides per VDIVPD), the cycles per op is cut in half (to eight), while plain DIVPD on two divides per instruction is only four cycles, so the division itself is basically the same speed regardless of whether you use SSE2, AVX or AVX512 instructions (AVX512 just saves you a little on latency, and load/store overhead). Even if vectorized instructions were never used, plain FDIV is only a 4-5 cycle instruction (binary floating point division happens to be easier than integer division in general, go figure), so you'd expect to see floating point math do pretty well.

Point is, at a hardware level, dividing a lot of 64 bit floating point values is cheaper than dividing a lot of 64 bit integer values, so true division using / is inherently faster than floor division using //.

On my own machine (which I've verified uses only baseline SSE2 DIVPD, so it only does two divisions per instruction), I attempted to replicate your results, and my timings were a little less divergent. True division which took 485 μs per operation, while floor division took 1.05 ms per op; floor division was only slightly more than 2x longer, where it was almost 3x longer for you. At a guess, your copy of numpy was compiled with support for AVX or AVX512, and you're squeezing a bit more performance out of true division as a result.

As for why non-numpy Python int floor division takes longer than true division, it's a similar reason, but there are a few complicating factors:

(Helps int / int, hurts int // int) The same hardware instruction overhead issues from numpy apply; IDIV is slower than FDIV.
(Hides performance differences) The interpreter overhead of performing individual divisions is a greater percentage of the total time (reduces relative performance differences, since more of the time is spent on overhead)
(Increases cost of integer operations in general) Integer division is even more expensive on Python ints; on CPython, int is implemented as an array of 15 or 30 bit limbs, with a ssize_t that defines both the signedness and the number of limbs. CPython's float is just a plain object wrapper around a normal C double with no special features.
(Increases the cost of int // int) Python guarantees floor division, but C only provides truncating integer division, so even in the fast path for small ints, Python has to check for mismatched signs and adjust the operation to ensure the result is the floor, not simple truncation.
(Increases cost of int / int operation specifically for large inputs) CPython's int / int operation doesn't just cast both operands to float (C double) and perform floating point division. It tries to do so when the operands are small enough, but if they're too large, it has a complicated fallback algorithm to achieve best possible correctness.
(Reduce cost of repeated int / int if result is discarded in short order) Since Python floats are fixed size, they implemented a minor performance optimization to use a free list for them; when you create and delete floats repeatedly, the newly created ones don't need to go to the memory allocator, they just pull from the free list, and release to the free list when the reference count drops to zero. CPython's ints are variable length, and a free list isn't used.

On balance, this is all slightly in favor of int / int (at least for small ints; the large int case gets more complex, but then, it's probably worse for int // int too, since array-based division algorithms are insanely complicated/expensive), so seeing similar behaviors with Python built-in types is not unexpected.

155

answered Oct 08 '22 01:10

ShadowRanger

Related questions
                            
                                Is there a scope for (numpy) random seeds?
                            
                                gRPC Python thread_pool vs max_concurrent_rpcs
                            
                                Django form.errors not showing up in template
                            
                                What does * mean as a parameter in python?
                            
                                Keras: how to get top-k accuracy
                            
                                Exception similar to ModuleNotFoundError in Python 2.7?
                            
                                Cumulative apply within window defined by other columns
                            
                                Variable X not updating when variables that should effect X change
                            
                                Aggregate unique values from multiple columns with pandas GroupBy
                            
                                How to receive and parse email with Cloud Functions?
                            
                                python sklearn get list of available hyper parameters for model
                            
                                Apply list of regex pattern on list python
                            
                                The smtplib.server.sendmail function in python raises UnicodeEncodeError: 'ascii' codec can't encode character
                            
                                Show text inside the tags BeautifulSoup
                            
                                How to build a simple RSS reader in Python 3.7?
                            
                                Cmake could not find boost_python
                            
                                Do JavaScript classes have a method equivalent to Python classes' __call__?
                            
                                How to suppress warning "Access to protected member" in pycharm method?
                            
                                Saving a TF2 keras model with custom signature defs
                            
                                How should I add a field containing a list of dictionaries in Marshmallow Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

speed of elementary mathematical operations in Numpy/Python: why is integer division slowest?

Tags:

performance

python

integer-division

numpy

integer-arithmetic

PDiracDelta

People also ask

1 Answers

ShadowRanger

Recent Activity

Donate For Us