Can simple calculations on variable length iterables be made faster in Python?

Question

I'm calculating the euclidean distance between two vectors represented by tuples.

(u[0]-v[0])**2 + (u[1]-v[1])**2 + (u[3]-v[3])**2 ...

The hard-coded way of doing this is pretty fast. However, I would like to make no assumptions about the length of these vectors. That results in solutions like:

sum([(a-b)**2 for a, b in izip(u, v)]) # Faster without generator

or

sum = 0
for i in xrange(len(u)):
    sum += (u[i]-v[i])**2

which turn out to be much (at least twice) slower than the first version. Is there some smart way of optimizing this, without resorting to NumPy/SciPy? I'm aware that those packages offer the fastest way of doing such things, but at the moment, I'm more trying to get experience with optimizing "bare Python". What I found works fast is to dynamically build a string that defines the function and exec() it, but that's really a last resort, I would say...

The requirements:

CPython 2.7
Standard library only
"Real" (e.g. no exec()), pure Python

Even though my question is about the matter of small operations in general, you may assume in your solution that one of the vectors remains the same over several function calls.

Marco Sulla · Accepted Answer

mysum = 0
for a, b in izip(u, v) :
    mysum += (a-b)**2

About 35% faster than #3

PS: have you tried Cython (not CPython) or Shedskin?

marr75 · Answer

What I'm understanding is that you don't really need to make the code faster, you just want to know why it's slower. To answer that, let's look at the disassembly. For the purposes of this discussion, I'm going to wrap each method in a function call, the loading of u and v and the return command can be ignored in each disassembly.

def test1(u, v):
    return (u[0]-v[0])**2 + (u[1]-v[1])**2 + (u[3]-v[3])**2

dis.dis(test1)
 0 LOAD_FAST                0 (u)
 3 LOAD_CONST               1 (0)
 6 BINARY_SUBSCR       
 7 LOAD_FAST                1 (v)
10 LOAD_CONST               1 (0)
13 BINARY_SUBSCR       
14 BINARY_SUBTRACT     
15 LOAD_CONST               2 (2)
18 BINARY_POWER        
19 LOAD_FAST                0 (u)
22 LOAD_CONST               3 (1)
25 BINARY_SUBSCR       
26 LOAD_FAST                1 (v)
29 LOAD_CONST               3 (1)
32 BINARY_SUBSCR       
33 BINARY_SUBTRACT     
34 LOAD_CONST               2 (2)
37 BINARY_POWER        
38 BINARY_ADD          
39 LOAD_FAST                0 (u)
42 LOAD_CONST               4 (3)
45 BINARY_SUBSCR       
46 LOAD_FAST                1 (v)
49 LOAD_CONST               4 (3)
52 BINARY_SUBSCR       
53 BINARY_SUBTRACT     
54 LOAD_CONST               2 (2)
57 BINARY_POWER        
58 BINARY_ADD          
59 RETURN_VALUE

I cut the first example off at a length of 3 because it would just repeat the same pattern over and over. You can quickly see that there is no function call overhead and pretty much the interpreter is doing the minimum possible work on these operands to achieve your result.

def test2(u, v):
    sum((a-b)**2 for a, b in izip(u, v))

dis.dis(test2)
 0 LOAD_GLOBAL              0 (sum)
 3 LOAD_CONST               1 (<code object <genexpr> at 02C6F458, file "<pyshell#10>", line 2>)
 6 MAKE_FUNCTION            0
 9 LOAD_GLOBAL              1 (izip)
12 LOAD_FAST                0 (u)
15 LOAD_FAST                1 (v)
18 CALL_FUNCTION            2
21 GET_ITER            
22 CALL_FUNCTION            1
25 CALL_FUNCTION            1
28 RETURN_VALUE

What we see here is that we create a function out of the generator expression, load 2 globals (sum and izip, global lookups are slower than local lookups, we can't avoid making them once but if they're going to be called in a tight loop, many people assign them to a local, such as _izip or _sum), and then suffer 4 expensive bytecode operations in a row, calling izip, getting the iterator from the generator, calling the function created by the generator, and then calling the sum function (which will consume the iterator and add each item before returning).

def test3(u, v):
    sum = 0
    for i in xrange(len(u)):
        sum += (u[i]-v[i])**2

dis.dis(test3)
 0 LOAD_CONST               1 (0)
 3 STORE_FAST               2 (sum)

 6 SETUP_LOOP              52 (to 61)
 9 LOAD_GLOBAL              0 (xrange)
12 LOAD_GLOBAL              1 (len)
15 LOAD_FAST                0 (u)
18 CALL_FUNCTION            1
21 CALL_FUNCTION            1
24 GET_ITER            
25 FOR_ITER                32 (to 60)
28 STORE_FAST               3 (i)

31 LOAD_FAST                2 (sum)
34 LOAD_FAST                0 (u)
37 LOAD_FAST                3 (i)
40 BINARY_SUBSCR       
41 LOAD_FAST                1 (v)
44 LOAD_FAST                3 (i)
47 BINARY_SUBSCR       
48 BINARY_SUBTRACT     
49 LOAD_CONST               2 (2)
52 BINARY_POWER        
53 INPLACE_ADD         
54 STORE_FAST               2 (sum)
57 JUMP_ABSOLUTE           25
60 POP_BLOCK           
61 LOAD_CONST               0 (None)
64 RETURN_VALUE

What we see here is a more straightforward version of what is happening in test2. No generator expression or call to sum, but we've replaced that function call overhead with an unnecessary function call by doing xrange(len(u)) instead of the faster solution suggested by @Lucas Malor.

def test4(u, v):
    mysum = 0
    for a, b in izip(u, v) :
        mysum += (a-b)**2
    return mysum

dis.dis(test4)
 0 LOAD_CONST               1 (0)
 3 STORE_FAST               2 (mysum)

 6 SETUP_LOOP              47 (to 56)
 9 LOAD_GLOBAL              0 (izip)
12 LOAD_FAST                0 (u)
15 LOAD_FAST                1 (v)
18 CALL_FUNCTION            2
21 GET_ITER            
22 FOR_ITER                30 (to 55)
25 UNPACK_SEQUENCE          2
28 STORE_FAST               3 (a)
31 STORE_FAST               4 (b)

34 LOAD_FAST                2 (mysum)
37 LOAD_FAST                3 (a)
40 LOAD_FAST                4 (b)
43 BINARY_SUBTRACT     
44 LOAD_CONST               2 (2)
47 BINARY_POWER        
48 INPLACE_ADD         
49 STORE_FAST               2 (mysum)
52 JUMP_ABSOLUTE           22
55 POP_BLOCK           

56 LOAD_FAST                2 (mysum)
59 RETURN_VALUE

The above represents @Lucas Malor's contribution and it's faster in a few ways. It replaces subscript operations with unpacking while reducing the number of calls to 1. This is, in many cases, as fast you're going to achieve with the constraints you've given us.

Note that it would only be worth evaluating a run-time generated string similar to the function in test1 if you were going to call the function enough times to merit the overhead. Note also that as the length of u and v becomes increasingly large (which is typically how algorithms of this type are evaluated) the function call overhead of the other solutions becomes increasingly small and therefore, in most cases, the most readable solution is vastly superior. At the same time, even though it's slower in small cases, if the length of your sequences, u and v, may be very long, I recommend a generator expression as opposed to a list comprehension. The memory savings will cause much faster execution in most cases (and faster gc).

Overall, my recommendation is that the tiny speedup in cases of short sequences is just not worth the increase in code size and inconsistent behavior with other implementations of python you're looking at by performing micro-optimizations. The "best" solution is almost certainly test2.

Can simple calculations on variable length iterables be made faster in Python?

Tags:

python

python-2.7

Thijs van Dien

2 Answers

Marco Sulla

marr75

Recent Activity

Donate For Us

Can simple calculations on variable length iterables be made faster in Python?

Tags:

python

python-2.7

Thijs van Dien

2 Answers

Marco Sulla

marr75

Related questions

Recent Activity

Donate For Us