Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel construction of a distance matrix

I work on hierarchical agglomerative clustering on large amounts of multidimensional vectors, and I noticed that the biggest bottleneck is the construction of the distance matrix. A naive implementation for this task is the following (here in Python):

''' v = an array (N,d), where rows are the observations
and columns the dimensions'''
def create_dist_matrix(v):
   N = v.shape[0]
   D = np.zeros((N,N))
   for i in range(N):
      for j in range(i+1):
          D[i,j] = cosine(v[i,:],v[j,:]) # scipy.spatial.distance.cosine()
   return D

I was wondering which is the best way to add some parallelism to this routine. An easy way would be to break and assign the outer loop to a number of jobs, e.g. if you have 10 processors, create 10 different jobs for different ranges of i and then concatenate the results. However this "horizontal" solution doesn't seem quite right. Are there any other parallel algorithms (or existing libraries) for this task? Any help would be highly appreciated.

like image 400
dkar Avatar asked Jun 28 '12 18:06

dkar


2 Answers

Looks like scikit-learn has a parallel version of pdist called pairwise_distances

from sklearn.metrics.pairwise import pairwise_distances

D = pairwise_distances(X = v, metric = 'cosine', n_jobs = -1)

where n_jobs = -1 specifies that all CPUs will be used.

like image 81
agartland Avatar answered Sep 28 '22 04:09

agartland


See @agartland answer — you can specify n_jobs in sklearn.metrics.pairwise.pairwise_distances or look for clustering algorithm at sklearn.cluster with n_jobs parameter. E. g. sklearn.cluster.KMeans.

Still, if you feeling adventurous, you can implement your own computation. For example, if you need 1D distance matrix for scipy.cluster.hierarchy.linkage you can use:

#!/usr/bin/env python3
from multiprocessing import Pool
import numpy as np
from time import time as ts


data = np.zeros((100,10)) # YOUR data: np.array[n_samples x m_features]
n_processes = 4           # YOUR number of processors
def metric(a, b):         # YOUR dist function
    return np.sum(np.abs(a-b)) 


n = data.shape[0]
k_max = n * (n - 1) // 2  # maximum elements in 1D dist array
k_step = n ** 2 // 500    # ~500 bulks
dist = np.zeros(k_max)    # resulting 1D dist array


def proc(start):
    dist = []
    k1 = start
    k2 = min(start + k_step, k_max)
    for k in range(k1, k2):
        # get (i, j) for 2D distance matrix knowing (k) for 1D distance matrix
        i = int(n - 2 - int(np.sqrt(-8 * k + 4 * n * (n - 1) - 7) / 2.0 - 0.5))
        j = int(k + i + 1 - n * (n - 1) / 2 + (n - i) * ((n - i) - 1) / 2)
        # store distance
        a = data[i, :]
        b = data[j, :]
        d = metric(a, b)
        dist.append(d)
    return k1, k2, dist


ts_start = ts()
with Pool(n_processes) as pool:
    for k1, k2, res in pool.imap_unordered(proc, range(0, k_max, k_step)):
        dist[k1:k2] = res
        print("{:.0f} minutes, {:,}..{:,} out of {:,}".format(
            (ts() - ts_start)/60, k1, k2, k_max))


print("Elapsed %.0f minutes" % ((ts() - ts_start) / 60))
print("Saving...")
np.savez("dist.npz", dist=dist)
print("DONE")

Just so you know, scipy.cluster.hierarchy.linkage implementation isn't parallel and its complexity is at least O(N*N). I'm not sure if scipy has parallel implementation of this function.

like image 43
Bill Avatar answered Sep 28 '22 05:09

Bill