Is it possible to speed up this loop in Python?

Tags:

The normal way to map a function in a numpy.narray like np.array[map(some_func,x)] or vectorize(f)(x) can't provide an index. The following code is just a simple example that is commonly seen in many applications.

dis_mat = np.zeros([feature_mat.shape[0], feature_mat.shape[0]])

for i in range(feature_mat.shape[0]):
    for j in range(i, feature_mat.shape[0]):
        dis_mat[i, j] = np.linalg.norm(
            feature_mat[i, :] - feature_mat[j, :]
        )
        dis_mat[j, i] = dis_mat[i, j]

Is there a way to speed it up?

Thank you for your help! The quickest way to speed up this code is this, using the function that @user2357112 commented about:

    from scipy.spatial.distance import pdist,squareform
    dis_mat = squareform(pdist(feature_mat))

@Julien's method is also good if feature_mat is small, but when the feature_mat is 1000 by 2000, then it needs nearly 40 GB of memory.

608

asked Nov 30 '17 04:11

Stephen Wang

2 Answers

SciPy comes with a function specifically to compute the kind of pairwise distances you're computing. It's scipy.spatial.distance.pdist, and it produces the distances in a condensed format that basically only stores the upper triangle of the distance matrix, but you can convert the result to square form with scipy.spatial.distance.squareform:

from scipy.spatial.distance import pdist, squareform

distance_matrix = squareform(pdist(feature_mat))

This has the benefit of avoiding the giant intermediate arrays required with a direct vectorized solution, so it's faster and works on larger inputs. It loses the timing to an approach that uses algebraic manipulations to have dot handle the heavy lifting, though.

pdist also supports a wide variety of alternate distance metrics, if you decide you want something other than Euclidean distance.

# Manhattan distance!
distance_matrix = squareform(pdist(feature_mat, 'cityblock'))

# Cosine distance!
distance_matrix = squareform(pdist(feature_mat, 'cosine'))

# Correlation distance!
distance_matrix = squareform(pdist(feature_mat, 'correlation'))

# And more! Check out the docs.

190

answered Oct 27 '22 04:10

user2357112 supports Monica

You can create a new axis and broadcast:

dis_mat = np.linalg.norm(feature_mat[:,None] - feature_mat, axis=-1)

Timing:

feature_mat = np.random.rand(100,200)

def a():
    dis_mat = np.zeros([feature_mat.shape[0], feature_mat.shape[0]])
    for i in range(feature_mat.shape[0]):
        for j in range(i, feature_mat.shape[0]):
            dis_mat[i, j] = np.linalg.norm(
                feature_mat[i, :] - feature_mat[j, :]
            )
            dis_mat[j, i] = dis_mat[i, j]

def b():
    dis_mat = np.linalg.norm(feature_mat[:,None] - feature_mat, axis=-1)



%timeit a()
100 loops, best of 3: 20.5 ms per loop

%timeit b()
100 loops, best of 3: 11.8 ms per loop

answered Oct 27 '22 04:10

Julien

Related questions
                            
                                How do I remove the last n characters from a string?
                            
                                matplotlib - 3d surface from a rectangular array of heights
                            
                                How to create fake text file in Python
                            
                                Django how to check if the object has property in view
                            
                                How to convert object to json file for three.js model loader
                            
                                Cannot write XML file with default namespace [duplicate]
                            
                                Call python script from ruby
                            
                                Deploying Django project with Gunicorn and nginx
                            
                                Insert and update with core SQLAlchemy
                            
                                Python/matplotlib : getting rid of matplotlib.mpl warning
                            
                                How to exit a Kivy application using a button
                            
                                Issues iterating through JSON list in Python?
                            
                                Matplotlib.pyplot.hist() very slow
                            
                                Pyspark - Aggregation on multiple columns
                            
                                Geopandas PostGIS connection
                            
                                What is the correct ways to write Boto3 filters to use customise tag name?
                            
                                Why multiprocessing.Process behave differently on windows and linux for global object and function arguments
                            
                                How to initialize biases in a Keras model?
                            
                                Drop column that starts with
                            
                                Is there a way to automatically activate a virtualenv as a docker entrypoint?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to speed up this loop in Python?

Tags:

python

numpy

Stephen Wang

People also ask

2 Answers

user2357112 supports Monica

Julien

Recent Activity

Donate For Us