I work on hierarchical agglomerative clustering on large amounts of multidimensional vectors, and I noticed that the biggest bottleneck is the construction of the distance matrix. A naive implementation for this task is the following (here in Python):
''' v = an array (N,d), where rows are the observations
and columns the dimensions'''
def create_dist_matrix(v):
N = v.shape[0]
D = np.zeros((N,N))
for i in range(N):
for j in range(i+1):
D[i,j] = cosine(v[i,:],v[j,:]) # scipy.spatial.distance.cosine()
return D
I was wondering which is the best way to add some parallelism to this routine. An easy way would be to break and assign the outer loop to a number of jobs, e.g. if you have 10 processors, create 10 different jobs for different ranges of i
and then concatenate the results. However this "horizontal" solution doesn't seem quite right. Are there any other parallel algorithms (or existing libraries) for this task? Any help would be highly appreciated.
Looks like scikit-learn
has a parallel version of pdist called pairwise_distances
from sklearn.metrics.pairwise import pairwise_distances
D = pairwise_distances(X = v, metric = 'cosine', n_jobs = -1)
where n_jobs = -1
specifies that all CPUs will be used.
See @agartland answer — you can specify n_jobs
in sklearn.metrics.pairwise.pairwise_distances or look for clustering algorithm at sklearn.cluster with n_jobs
parameter. E. g. sklearn.cluster.KMeans
.
Still, if you feeling adventurous, you can implement your own computation. For example, if you need 1D distance matrix for scipy.cluster.hierarchy.linkage
you can use:
#!/usr/bin/env python3
from multiprocessing import Pool
import numpy as np
from time import time as ts
data = np.zeros((100,10)) # YOUR data: np.array[n_samples x m_features]
n_processes = 4 # YOUR number of processors
def metric(a, b): # YOUR dist function
return np.sum(np.abs(a-b))
n = data.shape[0]
k_max = n * (n - 1) // 2 # maximum elements in 1D dist array
k_step = n ** 2 // 500 # ~500 bulks
dist = np.zeros(k_max) # resulting 1D dist array
def proc(start):
dist = []
k1 = start
k2 = min(start + k_step, k_max)
for k in range(k1, k2):
# get (i, j) for 2D distance matrix knowing (k) for 1D distance matrix
i = int(n - 2 - int(np.sqrt(-8 * k + 4 * n * (n - 1) - 7) / 2.0 - 0.5))
j = int(k + i + 1 - n * (n - 1) / 2 + (n - i) * ((n - i) - 1) / 2)
# store distance
a = data[i, :]
b = data[j, :]
d = metric(a, b)
dist.append(d)
return k1, k2, dist
ts_start = ts()
with Pool(n_processes) as pool:
for k1, k2, res in pool.imap_unordered(proc, range(0, k_max, k_step)):
dist[k1:k2] = res
print("{:.0f} minutes, {:,}..{:,} out of {:,}".format(
(ts() - ts_start)/60, k1, k2, k_max))
print("Elapsed %.0f minutes" % ((ts() - ts_start) / 60))
print("Saving...")
np.savez("dist.npz", dist=dist)
print("DONE")
Just so you know, scipy.cluster.hierarchy.linkage
implementation isn't parallel and its complexity is at least O(N*N). I'm not sure if scipy
has parallel implementation of this function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With