A few years ago, someone posted on Active State Recipes for comparison purposes, three python/NumPy functions; each of these accepted the same arguments and returned the same result, a distance matrix.
Two of these were taken from published sources; they are both--or they appear to me to be--idiomatic numpy code. The repetitive calculations required to create a distance matrix are driven by numpy's elegant index syntax. Here's one of them:
from numpy.matlib import repmat, repeat
def calcDistanceMatrixFastEuclidean(points):
numPoints = len(points)
distMat = sqrt(sum((repmat(points, numPoints, 1) -
repeat(points, numPoints, axis=0))**2, axis=1))
return distMat.reshape((numPoints,numPoints))
The third created the distance matrix using a single loop (which, obviously is a lot of looping given that a distance matrix of just 1,000 2D points, has one million entries). At first glance this function looked to me like the code I used to write when I was learning NumPy and I would write NumPy code by first writing Python code and then translating it, line by line.
Several months after the Active State post, results of performance tests comparing the three were posted and discussed in a thread on the NumPy mailing list.
The function with the loop in fact significantly outperformed the other two:
from numpy import mat, zeros, newaxis
def calcDistanceMatrixFastEuclidean2(nDimPoints):
nDimPoints = array(nDimPoints)
n,m = nDimPoints.shape
delta = zeros((n,n),'d')
for d in xrange(m):
data = nDimPoints[:,d]
delta += (data - data[:,newaxis])**2
return sqrt(delta)
One participant in the thread (Keir Mierle) offered a reason why this might be true:
The reason that I suspect this will be faster is that it has better locality, completely finishing a computation on a relatively small working set before moving onto the next one. The one liners have to pull the potentially large MxN array into the processor repeatedly.
By this poster's own account, his remark is only a suspicion, and it doesn't appear that it was discussed any further.
Any other thoughts about how to account for these results?
In particular, is there a useful rule--regarding when to loop and when to index--that can be extracted from this example as guidance in writing numpy code?
For those not familiar with NumPy, or who haven't looked at the code, this comparison is not based on an edge case--it certainly wouldn't be that interesting to me if it were. Instead, this comparison involves a function that performs a common task in matrix computation (i.e., creating a result array given two antecedents); moreover, each function is in turn comprised of among the most common numpy built-ins.
For index loop is working almost 2 times faster compare to for each loop and for iterator loop.
The index value represents the number of the currently executing iteration. More specifically, the body of the loop (the instructions that are executed during each loop repetition) can be said to be iterated as the loop executes.
In a for loop, the index of iteration is always a clearly defined variable. By common practice, the variable is usually the letter i. This makes it easy to index one or more arrays by the index. For loops can easily be used to iterate through elements of multidimensional arrays using nested for loops.
TL; DR The second code above is only looping over the number of dimensions of the points (3 times through the for loop for 3D points) so the looping isn't much there. The real speed-up in the second code above is that it better harnesses the power of Numpy to avoid creating some extra matrices when finding the differences between points. This reduces memory used and computational effort.
Longer Explanation
I think that the calcDistanceMatrixFastEuclidean2
function is deceiving you with its loop perhaps. It is only looping over the number of dimensions of the points. For 1D points, the loop only executes once, for 2D, twice, and for 3D, thrice. This is really not much looping at all.
Let's analyze the code a little bit to see why the one is faster than the other. calcDistanceMatrixFastEuclidean
I will call fast1
and calcDistanceMatrixFastEuclidean2
will be fast2
.
fast1
is based on the Matlab way of doing things as is evidenced by the repmap
function. The repmap
function creates an array in this case that is just the original data repeated over and over again. However, if you look at the code for the function, it is very inefficient. It uses many Numpy functions (3 reshape
s and 2 repeat
s) to do this. The repeat
function is also used to create an array that contains the the original data with each data item repeated many times. If our input data is [1,2,3]
then we are subtracting [1,2,3,1,2,3,1,2,3]
from [1,1,1,2,2,2,3,3,3]
. Numpy has had to create a lot of extra matrices in between running Numpy's C code which could have been avoided.
fast2
uses more of Numpy's heavy lifting without creating as many matrices between Numpy calls. fast2
loops through each dimension of the points, does the subtraction and keeps a running total of the squared differences between each dimension. Only at the end is the square root done. So far, this may not sound quite as efficient as fast1
, but fast2
avoids doing the repmat
stuff by using Numpy's indexing. Let's look at the 1D case for simplicity. fast2
makes a 1D array of the data and subtracts it from a 2D (N x 1) array of the data. This creates the difference matrix between each point and all of the other points without having to use repmat
and repeat
and thereby bypasses creating a lot of extra arrays. This is where the real speed difference lies in my opinion. fast1
creates a lot of extra in between matrices (and they are created expensively computationally) to find the differences between points while fast2
better harnesses the power of Numpy to avoid these.
By the way, here is a little bit faster version of fast2
:
def calcDistanceMatrixFastEuclidean3(nDimPoints):
nDimPoints = array(nDimPoints)
n,m = nDimPoints.shape
data = nDimPoints[:,0]
delta = (data - data[:,newaxis])**2
for d in xrange(1,m):
data = nDimPoints[:,d]
delta += (data - data[:,newaxis])**2
return sqrt(delta)
The difference is that we are no longer creating delta as a zeros matrix.
dis
for fun:
dis.dis(calcDistanceMatrixFastEuclidean)
2 0 LOAD_GLOBAL 0 (len)
3 LOAD_FAST 0 (points)
6 CALL_FUNCTION 1
9 STORE_FAST 1 (numPoints)
3 12 LOAD_GLOBAL 1 (sqrt)
15 LOAD_GLOBAL 2 (sum)
18 LOAD_GLOBAL 3 (repmat)
21 LOAD_FAST 0 (points)
24 LOAD_FAST 1 (numPoints)
27 LOAD_CONST 1 (1)
30 CALL_FUNCTION 3
4 33 LOAD_GLOBAL 4 (repeat)
36 LOAD_FAST 0 (points)
39 LOAD_FAST 1 (numPoints)
42 LOAD_CONST 2 ('axis')
45 LOAD_CONST 3 (0)
48 CALL_FUNCTION 258
51 BINARY_SUBTRACT
52 LOAD_CONST 4 (2)
55 BINARY_POWER
56 LOAD_CONST 2 ('axis')
59 LOAD_CONST 1 (1)
62 CALL_FUNCTION 257
65 CALL_FUNCTION 1
68 STORE_FAST 2 (distMat)
5 71 LOAD_FAST 2 (distMat)
74 LOAD_ATTR 5 (reshape)
77 LOAD_FAST 1 (numPoints)
80 LOAD_FAST 1 (numPoints)
83 BUILD_TUPLE 2
86 CALL_FUNCTION 1
89 RETURN_VALUE
dis.dis(calcDistanceMatrixFastEuclidean2)
2 0 LOAD_GLOBAL 0 (array)
3 LOAD_FAST 0 (nDimPoints)
6 CALL_FUNCTION 1
9 STORE_FAST 0 (nDimPoints)
3 12 LOAD_FAST 0 (nDimPoints)
15 LOAD_ATTR 1 (shape)
18 UNPACK_SEQUENCE 2
21 STORE_FAST 1 (n)
24 STORE_FAST 2 (m)
4 27 LOAD_GLOBAL 2 (zeros)
30 LOAD_FAST 1 (n)
33 LOAD_FAST 1 (n)
36 BUILD_TUPLE 2
39 LOAD_CONST 1 ('d')
42 CALL_FUNCTION 2
45 STORE_FAST 3 (delta)
5 48 SETUP_LOOP 76 (to 127)
51 LOAD_GLOBAL 3 (xrange)
54 LOAD_FAST 2 (m)
57 CALL_FUNCTION 1
60 GET_ITER
>> 61 FOR_ITER 62 (to 126)
64 STORE_FAST 4 (d)
6 67 LOAD_FAST 0 (nDimPoints)
70 LOAD_CONST 0 (None)
73 LOAD_CONST 0 (None)
76 BUILD_SLICE 2
79 LOAD_FAST 4 (d)
82 BUILD_TUPLE 2
85 BINARY_SUBSCR
86 STORE_FAST 5 (data)
7 89 LOAD_FAST 3 (delta)
92 LOAD_FAST 5 (data)
95 LOAD_FAST 5 (data)
98 LOAD_CONST 0 (None)
101 LOAD_CONST 0 (None)
104 BUILD_SLICE 2
107 LOAD_GLOBAL 4 (newaxis)
110 BUILD_TUPLE 2
113 BINARY_SUBSCR
114 BINARY_SUBTRACT
115 LOAD_CONST 2 (2)
118 BINARY_POWER
119 INPLACE_ADD
120 STORE_FAST 3 (delta)
123 JUMP_ABSOLUTE 61
>> 126 POP_BLOCK
8 >> 127 LOAD_GLOBAL 5 (sqrt)
130 LOAD_FAST 3 (delta)
133 CALL_FUNCTION 1
136 RETURN_VALUE
I'm not an expert on dis
, but it seems like you would have to look more at the functions that the first is calling to know why they take a while. There is a performance profiler tool with Python as well, cProfile
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With