EDIT2: As @ShadowRanger pointed out, this is a Numpy phenomenon, not Python. But when doing the calculations in Python with list comprehensions (so x+y
becomes [a+b for a,b in zip(x,y)]
) then all arithmetic operations still take equally long (although more than 100x as long as Numpy). However, when I use integer division in my real simulations, they run way faster.
So the main question remains: even in Python, why do these tests suggest integer division isn't faster than regular division?
EDIT1: versions: Python 3.5.5, Numpy 1.15.0.
It seems that in PythonNumpy, integer division is more expensive than regular division (of integers), which is counter-intuitive. When testing, I get this:
setup_string = 'import numpy as np;\
N=int(1e5);\
x=np.arange(1,N+1, dtype=int);\
y=np.arange(N, dtype=int);'
addition (+) ~ 0.1s
timeit("x+y", setup=setup_string, number=int(1e3))
0.09872294100932777
subtraction (-) ~ 0.1s
timeit("x-y", setup=setup_string, number=int(1e3))
0.09425603999989107
multiplication (*) ~ 0.1s
timeit("x*y", setup=setup_string, number=int(1e3))
0.09888673899695277
division (/) ~ 0.35s
timeit("x/y", setup=setup_string, number=int(1e3))
0.3574664070038125
integer division (//) ~ 1s (!)
timeit("x//y", setup=setup_string, number=int(1e3))
1.006298642983893
Any ideas why this is happening? Why is integer division not faster?
As for why non- numpy Python int floor division takes longer than true division, it's a similar reason, but there are a few complicating factors: (Helps int / int , hurts int // int ) The same hardware instruction overhead issues from numpy apply; IDIV is slower than FDIV .
Division is much slower, than multiplication. But some smart compliers / VMs transform division into multiplication, so your tests will have the same results (both tests test multiplication).
Thus, programmers should be aware multiplication is faster than division and should favor multiplication if they know the rounding error in the inverse is acceptable.
And the results (see the comments) are similar to that of Intel: division is about 3-6 times slower than multiplication.
Short answer: Floating point division is cheaper at the hardware level than integer division. And on at least one common architecture, floating point division can be vectorized, while integer division cannot, so the most expensive operation in your code must be performed more times, and higher cost per operation for integer math, while floating point math does it fewer times, at lower cost per operation.
Long answer: numpy
uses vectorized math when available, and the x86-64 architecture (which I'm guessing you're using) doesn't provide a SIMD instruction for integer division. It only provides vectorized multiplication for integers (via the PMULUDQ
family of instructions), but provides both multiplication (MULPD
family) and division (DIVPD
family) for floating point.
When you use /
for true divide, the result type is float64
, not int64
, and numpy
could perform the operation by a single packed load and convert (with the VCVTQQ2PD
family of operations, followed by a packed divide, followed by a packed move back to memory (MOVAPD
family).
On the most modern x86-64 chips with AVX512 (Xeon Phi x200+ and Skylake-X and higher, the latter being available to the desktop market since late 2017), each such vectorized instruction can perform eight operations at once (older architectures from post-2011 can do four with AVX, and before that you could do two with SSE2). For /
, that means you only need to issue two VCVTQQ2PD
s (one from each source array), one VDIVPD
and one VMOVAPD
(all EVEX
prefixed for 512 bit operation) for every eight divisions to be performed. By contrast, for //
to perform the same eight divisions it needs to issue eight MOV
s from memory (to load the left array operands), eight CQO
s (to sign extend the left array operands to the 128 bit values IDIV
needs), eight IDIV
s (which loads from the right side array for you at least), and eight MOV
s back to memory.
I don't know if numpy
takes full advantage of this (my own copy is clearly compiled for the SSE2 baseline all x86-64 machines provide, so it only does two divisions at once, not eight), but it's possible, where there is no way to vectorize the equivalent integer operations at all.
While the individual instructions for the integer case are often a little cheaper, they're basically always more expensive than the combined equivalent instructions. And for integer division, it's actually worse for a single operation than floating point division for the packed operation; per Agner Fog's Skylake-X table, the cost per IDIV
is 24-90 cycles, with a latency of 42-95; the cost of VDIVPD
with all 512 bit registers is 16 cycles, with a latency of 24 cycles. VDIVPD
isn't just doing eight times the work, it's doing it in (at most) two thirds the cycles required by IDIV
(I don't know why IDIV
has such a great range for cycle count, but VDIVPD
beats even the best number for IDIV
). For plain AVX operations (only four divides per VDIVPD
), the cycles per op is cut in half (to eight), while plain DIVPD
on two divides per instruction is only four cycles, so the division itself is basically the same speed regardless of whether you use SSE2, AVX or AVX512 instructions (AVX512 just saves you a little on latency, and load/store overhead). Even if vectorized instructions were never used, plain FDIV
is only a 4-5 cycle instruction (binary floating point division happens to be easier than integer division in general, go figure), so you'd expect to see floating point math do pretty well.
Point is, at a hardware level, dividing a lot of 64 bit floating point values is cheaper than dividing a lot of 64 bit integer values, so true division using /
is inherently faster than floor division using //
.
On my own machine (which I've verified uses only baseline SSE2 DIVPD
, so it only does two divisions per instruction), I attempted to replicate your results, and my timings were a little less divergent. True division which took 485 μs per operation, while floor division took 1.05 ms per op; floor division was only slightly more than 2x longer, where it was almost 3x longer for you. At a guess, your copy of numpy
was compiled with support for AVX or AVX512, and you're squeezing a bit more performance out of true division as a result.
As for why non-numpy
Python int
floor division takes longer than true division, it's a similar reason, but there are a few complicating factors:
int / int
, hurts int // int
) The same hardware instruction overhead issues from numpy
apply; IDIV
is slower than FDIV
.int
s; on CPython, int
is implemented as an array of 15 or 30 bit limbs, with a ssize_t
that defines both the signedness and the number of limbs. CPython's float
is just a plain object wrapper around a normal C double
with no special features.int // int
) Python guarantees floor division, but C only provides truncating integer division, so even in the fast path for small int
s, Python has to check for mismatched signs and adjust the operation to ensure the result is the floor, not simple truncation.int / int
operation specifically for large inputs) CPython's int / int
operation doesn't just cast both operands to float
(C double
) and perform floating point division. It tries to do so when the operands are small enough, but if they're too large, it has a complicated fallback algorithm to achieve best possible correctness.int / int
if result is discarded in short order) Since Python float
s are fixed size, they implemented a minor performance optimization to use a free list for them; when you create and delete float
s repeatedly, the newly created ones don't need to go to the memory allocator, they just pull from the free list, and release to the free list when the reference count drops to zero. CPython's int
s are variable length, and a free list isn't used.On balance, this is all slightly in favor of int / int
(at least for small int
s; the large int
case gets more complex, but then, it's probably worse for int // int
too, since array-based division algorithms are insanely complicated/expensive), so seeing similar behaviors with Python built-in types is not unexpected.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With