Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why np.hypot and np.subtract.outer very fast compared to vanilla broadcast ? Using Numba for speedup numpy in parallel for distance matrix calculation

I have two large sets of 2D points and I need to calculate a distance matrix.

I needed it to be fast and in python, so obviously I used numpy. I recently learned about numpy broadcasting and used it, rather than looping in python numpy will do it in C.

I really thought broadcasting is all I need until I see other methods working better than vanilla broadcasting, I have two ways I calculate distance matrix and I don't understand why one is better than other.

I looked up here https://github.com/numpy/numpy/issues/14761 and I have contradictory results.

Below are two ways to calculate distance matrix

Cells [3, 4, 6] and [8, 9] both calculate the distance matrix, but 3+4 uses subtract.outer is way faster than 8 which uses vanilla broadcast and 6 which uses hypot is way faster than 9 which is simple way. I did not try in python loop assumming it will never finish.

I want to know

1. Is there a faster way to calculate distance matrix (maybe scikit-learn or scipy)?

2. Why hypot and subtract.outer are so fast?

I have also attached snippet tp run whole thing for convenience and I change seed to prevent cache resue

### Cell 1
import numpy as np

np.random.seed(858442)

### Cell 2
%%time
obs = np.random.random((50000, 2))
interp = np.random.random((30000, 2))

CPU times: user 2.02 ms, sys: 1.4 ms, total: 3.42 ms
Wall time: 1.84 ms

### Cell 3
%%time
d0 = np.subtract.outer(obs[:,0], interp[:,0])

CPU times: user 2.46 s, sys: 1.97 s, total: 4.42 s
Wall time: 4.42 s

### Cell 4
%%time
d1 = np.subtract.outer(obs[:,1], interp[:,1])

CPU times: user 3.1 s, sys: 2.7 s, total: 5.8 s
Wall time: 8.34 s

### Cell 5
%%time
h = np.hypot(d0, d1)

CPU times: user 12.7 s, sys: 24.6 s, total: 37.3 s
Wall time: 1min 6s

### Cell 6
np.random.seed(773228)

### Cell 7
%%time
obs = np.random.random((50000, 2))
interp = np.random.random((30000, 2))

CPU times: user 1.84 ms, sys: 1.56 ms, total: 3.4 ms
Wall time: 2.03 ms

### Cell 8
%%time
d = obs[:, np.newaxis, :] - interp
d0, d1 = d[:, :, 0], d[:, :, 1]

CPU times: user 22.7 s, sys: 8.24 s, total: 30.9 s
Wall time: 33.2 s

### Cell 9
%%time
h = np.sqrt(d0**2 + d1**2)

CPU times: user 29.1 s, sys: 2min 12s, total: 2min 41s
Wall time: 6min 10s

Update thanks to Jérôme Richard here

  • Stackoverflow never disappoints
  • There is a faster way using numba
  • It has just in time compiler which will convert python snippet to fast machine code, the first time you use it will be little slower than subsequent use since it compiles. But even for first time njit parallel beats hypot + subtract.outer by 9x margin for (49000, 12000) matrix

Performance of various methods

  • make sure to use different seed each time running script
import sys
import time

import numba as nb
import numpy as np

np.random.seed(int(sys.argv[1]))

d0 = np.random.random((49000, 2))
d1 = np.random.random((12000, 2))

def f1(d0, d1):
    print('Numba without parallel')
    res = np.empty((d0.shape[0], d1.shape[0]), dtype=d0.dtype)
    for i in nb.prange(d0.shape[0]):
        for j in range(d1.shape[0]):
            res[i, j] = np.sqrt((d0[i, 0] - d1[j, 0])**2 + (d0[i, 1] - d1[j, 1])**2)
    return res

# Add eager compilation, compiles before hand
@nb.njit((nb.float64[:, :], nb.float64[:, :]), parallel=True)
def f2(d0, d1):
    print('Numba with parallel')
    res = np.empty((d0.shape[0], d1.shape[0]), dtype=d0.dtype)
    for i in nb.prange(d0.shape[0]):
        for j in range(d1.shape[0]):
            res[i, j] = np.sqrt((d0[i, 0] - d1[j, 0])**2 + (d0[i, 1] - d1[j, 1])**2)
    return res

def f3(d0, d1):
    print('hypot + subtract.outer')
    np.hypot(
        np.subtract.outer(d0[:,0], d1[:,0]),
        np.subtract.outer(d0[:,1], d1[:,1])
    )

if __name__ == '__main__':
    s1 = time.time()
    eval(f'{sys.argv[2]}(d0, d1)')
    print(time.time() - s1)
(base) ~/xx@xx:~/xx$ python3 test.py 523432 f3
hypot + subtract.outer
9.79756784439087
(base) xx@xx:~/xx$ python3 test.py 213622 f2
Numba with parallel
0.3393140316009521

I will update this post for further developments and if I found even faster method

like image 201
eroot163pi Avatar asked Jul 30 '21 13:07

eroot163pi


People also ask

What is the difference between NP newaxis and Numba?

np.newaxis is generally prefered to reshape because you don’t need to write the array sizes and it is more efficient as the array is not copied in memory. What is Numba? Numba is just a compiler that takes a subset of the Python language and compiles it to a native function.

How does the NumPy subtract() method work?

However, the NumPy library allows the np.subtract () method to work even if argument matrices are not of the same shape. It does so with help of a mechanism called broadcasting, which defines how NumPy treats arrays of different shapes during arithmetic operations. Ultimately, they're equalized shape-wise, and the usual subtraction takes place.

What makes Numba so special?

What makes Numba shine are really loops like in the example. Note: don’t reimplement linear algebra computations (like np.dot for matrices) in Numba, the Numpy implementation is very optimized and can be called in Numba. you only write Python code it works similarly in PyTorch where the speedup is even more consequent when you apply Autograd.

Is it possible to subtract matrices with different shapes in NumPy?

Those matrices definitely have different shapes, rowMatrix.shape is (1, 3), and columnMatrix.shape is (3, 1). This could confuse you into thinking that you can't perform subtraction of them in NumPy, but that is definitely possible (albeit, indirectly, as they're automatically broadcast before subtraction):


1 Answers

First of all, d0 and d1 takes each 50000 x 30000 x 8 = 12 GB which is pretty big. Make sure you have more than 100 GB of memory because this is what the whole script requires! This is a huge amount of memory. If you do not have enough memory, the operating system will use a storage device (eg. swap) to store excess data which is much slower. Actually, there is no reason Cell-4 is slower than Cell-3 and I guess that you already do not have enough memory to (fully) store d1 in RAM while d0 seems to fit (mostly) in memory. There is not difference on my machine when both can fit in RAM (one can also reverse the order of the operations to check this). This also explain why further operation tends to get slower.

That being said, Cells 8+9 are also slower because they create temporary arrays and need more memory passes to compute the result than Cells 3+4+5. Indeed, the expression np.sqrt(d0**2 + d1**2) first compute d0**2 in memory resulting in a new 12 GB temporary array, then compute d1**2 resulting in another 12 GB temporary array, then perform the sum of the two temporary array to produce another new 12 GB temporary array, and finally compute the square-root resulting in another 12 GB temporary array. This can required up to 48 GB of memory and require 4 read-write memory-bound passes. This is not efficient and do not use the CPU/RAM efficiently (eg. CPU cache).

There is a much faster implementation consisting in doing the whole computation in 1 pass and in parallel using the Numba's JIT. Here is an example:

import numba as nb
@nb.njit(parallel=True)
def distanceMatrix(a, b):
    res = np.empty((a.shape[0], b.shape[0]), dtype=a.dtype)
    for i in nb.prange(a.shape[0]):
        for j in range(b.shape[0]):
            res[i, j] = np.sqrt((a[i, 0] - b[j, 0])**2 + (a[i, 1] - b[j, 1])**2)
    return res

This implementation use 3 times less memory (only 12 GB) and is much faster than the one using subtract.outer. Indeed, due to swapping, Cell 3+4+5 takes few minutes while this one takes 1.3 second!

The takeaway is that memory accesses are expensive as well as temporary array. One need to avoid using multiple passes in memory while working on huge buffers and take advantage of CPU caches when the computation performed is not trivial (for example by using array chunks).

like image 73
Jérôme Richard Avatar answered Oct 30 '22 07:10

Jérôme Richard