The normal way to map a function in a numpy.narray
like np.array[map(some_func,x)]
or vectorize(f)(x)
can't provide an index.
The following code is just a simple example that is commonly seen in many applications.
dis_mat = np.zeros([feature_mat.shape[0], feature_mat.shape[0]])
for i in range(feature_mat.shape[0]):
for j in range(i, feature_mat.shape[0]):
dis_mat[i, j] = np.linalg.norm(
feature_mat[i, :] - feature_mat[j, :]
)
dis_mat[j, i] = dis_mat[i, j]
Is there a way to speed it up?
Thank you for your help! The quickest way to speed up this code is this, using the function that @user2357112 commented about:
from scipy.spatial.distance import pdist,squareform
dis_mat = squareform(pdist(feature_mat))
@Julien's method is also good if feature_mat
is small, but when the feature_mat
is 1000 by 2000, then it needs nearly 40 GB of memory.
List comprehensions are faster than for loops to create lists. But, this is because we are creating a list by appending new elements to it at each iteration.
This opens up possibility of many optimizations which are not possible in Python. For this reason, we see loops in python are often much slower than in C, and nested loops is where things can really get slow.
SciPy comes with a function specifically to compute the kind of pairwise distances you're computing. It's scipy.spatial.distance.pdist
, and it produces the distances in a condensed format that basically only stores the upper triangle of the distance matrix, but you can convert the result to square form with scipy.spatial.distance.squareform
:
from scipy.spatial.distance import pdist, squareform
distance_matrix = squareform(pdist(feature_mat))
This has the benefit of avoiding the giant intermediate arrays required with a direct vectorized solution, so it's faster and works on larger inputs. It loses the timing to an approach that uses algebraic manipulations to have dot
handle the heavy lifting, though.
pdist
also supports a wide variety of alternate distance metrics, if you decide you want something other than Euclidean distance.
# Manhattan distance!
distance_matrix = squareform(pdist(feature_mat, 'cityblock'))
# Cosine distance!
distance_matrix = squareform(pdist(feature_mat, 'cosine'))
# Correlation distance!
distance_matrix = squareform(pdist(feature_mat, 'correlation'))
# And more! Check out the docs.
You can create a new axis and broadcast:
dis_mat = np.linalg.norm(feature_mat[:,None] - feature_mat, axis=-1)
Timing:
feature_mat = np.random.rand(100,200)
def a():
dis_mat = np.zeros([feature_mat.shape[0], feature_mat.shape[0]])
for i in range(feature_mat.shape[0]):
for j in range(i, feature_mat.shape[0]):
dis_mat[i, j] = np.linalg.norm(
feature_mat[i, :] - feature_mat[j, :]
)
dis_mat[j, i] = dis_mat[i, j]
def b():
dis_mat = np.linalg.norm(feature_mat[:,None] - feature_mat, axis=-1)
%timeit a()
100 loops, best of 3: 20.5 ms per loop
%timeit b()
100 loops, best of 3: 11.8 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With