Why does Python3 run faster if it is negating vs XOR?

Question

I thought negating a word or XOR a word with 1 both should take 1 clock cycle of the processor, but if using Python3, the following code to count the parity of a word:

def parity(x: int) -> int:
    p = 0
    while x:
        p = ~p
        x = x & (x-1)
    return p & 1

running against 10,000 test cases would report 4μs average running time consistently, but if it is:

def parity(x: int) -> int:
    p = 0
    while x:
        p = p ^ 1
        x = x & (x-1)
    return p

then it is consistently 5μs. It actually has one less operation at the end not needing to & 1.

Even

def parity(x: int) -> int:
    p = 0
    while x:
        p += 1
        x = x & (x-1)
    return p % 2

is consistently running for 4μs. To test increment vs addition, I changed the p += 1 to p += 7 and it is again consistently 4μs. What is the reason negating, increment, or addition faster than XOR using Python3? (it is on a Mac if it makes any difference).

Hymns For Disco · Accepted Answer

As it turns out, in Python, not all operators are created equally.

You mention "clock cycle"s in your question. It is true that a typical CPU can do many of these tasks in one cycle, however, execution of Python code (in CPython), is dictated by the compiled bytecode. This is a list of operations (op codes), which are mapped to C code in the CPython runtime.

The execution time of these portions of C code obviously may differ. In fact, you'll find that none of them associate with a "1 cycle" operation, but rather multiple function calls and branching statements.

If you disassemble the 2 functions you have given, dis.dis(parity), you will see how their op codes differ in the relevant section.

From the first function (negate)

10 LOAD_FAST                1 (p)
12 UNARY_INVERT
14 STORE_FAST               1 (p)

From the second function (XOR):

10 LOAD_FAST                1 (p)
12 LOAD_CONST               2 (1)
14 BINARY_XOR
16 STORE_FAST               1 (p)

Most notably, the UNARY_INVERT is changed to a BINARY_XOR. Check the master switch (opcode) { ... } in Python/ceval.c to see how these op codes and their corresponding code differ.

Here is a small test program to compare the time of some of the different operators in Python.

import timeit

print('binary operators')

ops = ['+', '-', '^', '&', '|']
for op in ops:
    t = timeit.timeit(f'a = a {op} b', 'a = 1; b = 2')
    print(op, t)

print('unary operators')

ops = ['-', '~', 'not']
for op in ops:
    t = timeit.timeit(f'a = {op} a', 'a = 1')
    print(op, t)

Here are my results in CPython 3.7.2 64 bit (Windows)

binary operators
+ 0.038782999999999956
- 0.03799190000000002
^ 0.05287609999999998
& 0.03716779999999997
| 0.0518267
unary operators
- 0.020763399999999987
~ 0.020213900000000007
not 0.01701140000000001

From these results, it seems ^ or BINARY_XOR is in fact the slowest operator (of those tested)

Why does Python3 run faster if it is negating vs XOR?

Tags:

performance

python

python-3.x

bit-manipulation

nonopolarity

1 Answers

Hymns For Disco

Recent Activity

Donate For Us

Why does Python3 run faster if it is negating vs XOR?

Tags:

performance

python

python-3.x

bit-manipulation

nonopolarity

1 Answers

Hymns For Disco

Related questions

Recent Activity

Donate For Us