Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to speed up this loop in Python?

Tags:

python

numpy

The normal way to map a function in a numpy.narray like np.array[map(some_func,x)] or vectorize(f)(x) can't provide an index. The following code is just a simple example that is commonly seen in many applications.

dis_mat = np.zeros([feature_mat.shape[0], feature_mat.shape[0]])

for i in range(feature_mat.shape[0]):
    for j in range(i, feature_mat.shape[0]):
        dis_mat[i, j] = np.linalg.norm(
            feature_mat[i, :] - feature_mat[j, :]
        )
        dis_mat[j, i] = dis_mat[i, j]

Is there a way to speed it up?


Thank you for your help! The quickest way to speed up this code is this, using the function that @user2357112 commented about:

    from scipy.spatial.distance import pdist,squareform
    dis_mat = squareform(pdist(feature_mat))

@Julien's method is also good if feature_mat is small, but when the feature_mat is 1000 by 2000, then it needs nearly 40 GB of memory.

like image 608
Stephen Wang Avatar asked Nov 30 '17 04:11

Stephen Wang


People also ask

What is faster than for loop in Python?

List comprehensions are faster than for loops to create lists. But, this is because we are creating a list by appending new elements to it at each iteration.

Are loops slow in Python?

This opens up possibility of many optimizations which are not possible in Python. For this reason, we see loops in python are often much slower than in C, and nested loops is where things can really get slow.


2 Answers

SciPy comes with a function specifically to compute the kind of pairwise distances you're computing. It's scipy.spatial.distance.pdist, and it produces the distances in a condensed format that basically only stores the upper triangle of the distance matrix, but you can convert the result to square form with scipy.spatial.distance.squareform:

from scipy.spatial.distance import pdist, squareform

distance_matrix = squareform(pdist(feature_mat))

This has the benefit of avoiding the giant intermediate arrays required with a direct vectorized solution, so it's faster and works on larger inputs. It loses the timing to an approach that uses algebraic manipulations to have dot handle the heavy lifting, though.

pdist also supports a wide variety of alternate distance metrics, if you decide you want something other than Euclidean distance.

# Manhattan distance!
distance_matrix = squareform(pdist(feature_mat, 'cityblock'))

# Cosine distance!
distance_matrix = squareform(pdist(feature_mat, 'cosine'))

# Correlation distance!
distance_matrix = squareform(pdist(feature_mat, 'correlation'))

# And more! Check out the docs.
like image 190
user2357112 supports Monica Avatar answered Oct 27 '22 04:10

user2357112 supports Monica


You can create a new axis and broadcast:

dis_mat = np.linalg.norm(feature_mat[:,None] - feature_mat, axis=-1)

Timing:

feature_mat = np.random.rand(100,200)

def a():
    dis_mat = np.zeros([feature_mat.shape[0], feature_mat.shape[0]])
    for i in range(feature_mat.shape[0]):
        for j in range(i, feature_mat.shape[0]):
            dis_mat[i, j] = np.linalg.norm(
                feature_mat[i, :] - feature_mat[j, :]
            )
            dis_mat[j, i] = dis_mat[i, j]

def b():
    dis_mat = np.linalg.norm(feature_mat[:,None] - feature_mat, axis=-1)



%timeit a()
100 loops, best of 3: 20.5 ms per loop

%timeit b()
100 loops, best of 3: 11.8 ms per loop
like image 36
Julien Avatar answered Oct 27 '22 04:10

Julien