Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing an efficient graph data structure for maintaining cluster distances in the Rank-Order Clustering algorithm

I'm trying to implement the Rank-Order Clustering here is a link to the paper (which is a kind of agglomerative clustering) algorithm from scratch. I have read through the paper (many times) and I have an implementation that is working although it is a lot slower than I expect.

Here is a link to my Github which has instructions to download and run the Jupyter Notebook.

The algorithm:

Algorithm 1 Rank-Order distance based clustering

Input:
  N faces, Rank-Order distance threshold t .
Output:
  A cluster set C and an “un-grouped” cluster Cun.
1: Initialize clusters C = { C1, C2, … CN }
 by letting each face be a single-element cluster.
2: repeat
3:  for all pair Cj and Ci in C do
4:   Compute distances DR( Ci, Cj) by (4) and DN(Ci, Cj) by (5).
5:   if DR(Ci, Cj) < t and DN(Ci, Cj) < 1 then
6:    Denote ⟨Ci, Cj⟩ as a candidate merging pair.
7:   end if
8:  end for
9:  Do “transitive” merge on all candidate merging pairs.
  (For example, Ci, Cj, Ck are merged
  if ⟨Ci, Cj⟩ and ⟨Cj, Ck⟩ are candidate merging pairs.)
10:  Update C and absolute distances between clusters by (3).
11: until No merge happens
12: Move all single-element clusters in C into an “un-grouped” face cluster Cun.
13: return C and Cun.

My implementation:

I have defined a Cluster class like so:

class Cluster:
    def __init__(self):
        self.faces = list()
        self.absolute_distance_neighbours = None

A Face class like so:

class Face:
    def __init__(self, embedding):
        self.embedding = embedding # a point in 128 dimensional space
        self.absolute_distance_neighbours = None

I have also implemented finding the rank-order distance (D^R(C_i, C_j)) and the normalized distance (D^N(C_i, C_j)) used in step 4 so these can be taken for granted.

Here is my implementation for finding the closest absolute distance between two clusters:

def find_nearest_distance_between_clusters(cluster1, cluster2):
    nearest_distance = sys.float_info.max
    for face1 in cluster1.faces:
        for face2 in cluster2.faces:
            distance = np.linalg.norm(face1.embedding - face2.embedding, ord = 1)

            if distance < nearest_distance: 
                nearest_distance = distance

            # If there is a distance of 0 then there is no need to continue
            if distance == 0:
                return(0)
    return(nearest_distance)


def assign_absolute_distance_neighbours_for_clusters(clusters, N = 20):
    for i, cluster1 in enumerate(clusters):
        nearest_neighbours = []
        for j, cluster2 in enumerate(clusters):
            distance = find_nearest_distance_between_clusters(cluster1, cluster2)    
            neighbour = Neighbour(cluster2, distance)
            nearest_neighbours.append(neighbour)
        nearest_neighbours.sort(key = lambda x: x.distance)
        # take only the top N neighbours
        cluster1.absolute_distance_neighbours = nearest_neighbours[0:N]

Here is my implementation of the rank-order clustering algorithm (assume that the implementation of find_normalized_distance_between_clusters and find_rank_order_distance_between_clusters is correct):

import networkx as nx
def find_clusters(faces):
    clusters = initial_cluster_creation(faces) # makes each face a cluster
    assign_absolute_distance_neighbours_for_clusters(clusters)
    t = 14 # threshold number for rank-order clustering
    prev_cluster_number = len(clusters)
    num_created_clusters = prev_cluster_number
    is_initialized = False

    while (not is_initialized) or (num_created_clusters):
        print("Number of clusters in this iteration: {}".format(len(clusters)))

        G = nx.Graph()
        for cluster in clusters:
            G.add_node(cluster)

        processed_pairs = 0

        # Find the candidate merging pairs
        for i, cluster1 in enumerate(clusters):

            # Only get the top 20 nearest neighbours for each cluster
            for j, cluster2 in enumerate([neighbour.entity for neighbour in \
                                          cluster1.absolute_distance_neighbours]):
                processed_pairs += 1
                print("Processed {}/{} pairs".format(processed_pairs, len(clusters) * 20), end="\r")
                # No need to merge with yourself 
                if cluster1 is cluster2:
                    continue
                else: 
                    normalized_distance = find_normalized_distance_between_clusters(cluster1, cluster2)
                    if (normalized_distance >= 1):
                        continue
                    rank_order_distance = find_rank_order_distance_between_clusters(cluster1, cluster2)
                    if (rank_order_distance >= t):
                        continue
                    G.add_edge(cluster1, cluster2) # add an edge to denote that these two clusters are to be merged

        # Create the new clusters            
        clusters = []
        # Note here that nx.connected_components(G) are 
        # the clusters that are connected
        for _clusters in nx.connected_components(G):
            new_cluster = Cluster()
            for cluster in _clusters:
                for face in cluster.faces:
                    new_cluster.faces.append(face)
            clusters.append(new_cluster)


        current_cluster_number = len(clusters)
        num_created_clusters = prev_cluster_number - current_cluster_number
        prev_cluster_number = current_cluster_number


        # Recalculate the distance between clusters (this is what is taking a long time)
        assign_absolute_distance_neighbours_for_clusters(clusters)


        is_initialized = True

    # Now that the clusters have been created, separate them into clusters that have one face
    # and clusters that have more than one face
    unmatched_clusters = []
    matched_clusters = []

    for cluster in clusters:
        if len(cluster.faces) == 1:
            unmatched_clusters.append(cluster)
        else:
            matched_clusters.append(cluster)

    matched_clusters.sort(key = lambda x: len(x.faces), reverse = True)

    return(matched_clusters, unmatched_clusters)

The problem:

The reason for the slow performance is due to step 10: Update C and absolute distance between clusters by (3) where (3) is:

enter image description here

This is the smallest L1-norm distance between all the faces in C_i (cluster i) and C_j (cluster j)

After merging the clusters
Since I have to recalculate the absolute distances between the newly created clusters every time I finish finding the candidate merging pairs in steps 3 - 8. I'm basically having to do a nested for loop for all the created cluster and then having ANOTHER nested for loop to find the two faces that have the closest distance. Afterwards, I still have to sort the neighbours by nearest distance!

I believe that this is the wrong approach as I have looked at OpenBR which has also implemented the same Rank-Order Clustering algorithm that I want it is under the method name:

Clusters br::ClusterGraph(Neighborhood neighborhood, float aggressiveness, const QString &csv)

Although I'm not that familiar with C++ I'm pretty sure that they are not recalculating the absolute distances between the clusters which leads me to believe that this is the part of the algorithm that I am implementing wrongly.

Moreover, at the top of their method declaration the comments say that they have pre-computed a kNN graph which makes sense as when I recalculate the absolute distances between clusters I am doing a lot of computation that I have previously done. I believe that the key is to precompute a kNN graph for the clusters although this is the part that I'm stuck at. I can't think of how to implement the data structure so that the absolute distances of the clusters would not have to be recalculated every time they are merged.

like image 617
YellowPillow Avatar asked Apr 18 '17 01:04

YellowPillow


2 Answers

At a high level, and this is what OpenBR seems to do as well, what is needed is a lookup table for cluster ID -> cluster object from which a new cluster list is generated from without re-calculation.

Can see where the new cluster is generated from an ID lookup table at this section on OpenBR.

For your code, will need to add an ID to each Cluster object, integers will probably be best for memory usage. Then update the merge code to create a list of to-be-merged indices at findClusters and create a new cluster list from the existing cluster indices. Then merge and update neighbours from their indices.

The last step, neighbour indice merging can be seen here on OpenBR.

The key part is that no new clusters are created on merge and distance for them is not re-calculated. Only indices are updated from existing cluster objects and their neighbouring distances merged.

like image 169
danny Avatar answered Nov 02 '22 16:11

danny


You could try to store distance values between faces in dictionary ex.

class Face:
    def __init__(self, embedding, id):
        self.embedding = embedding # a point in 128 dimensional space
        self.absolute_distance_neighbours = None
        self.id = id #Add face unique id

distances = {}

def find_nearest_distance_between_clusters(cluster1, cluster2):
    nearest_distance = sys.float_info.max
    for face1 in cluster1.faces:
        for face2 in cluster2.faces:
            if not distances.has_key( (face1.id, face2.id) ):
                distances[(face1.id, face2.id)] = np.linalg.norm(face1.embedding - face2.embedding, ord = 1) #calc distance only once
            distance = distances[(face1.id, face2.id)] #use precalc distances
            if distance < nearest_distance: 
                nearest_distance = distance

            # If there is a distance of 0 then there is no need to continue
            if distance == 0:
                return(0)
    return(nearest_distance)
like image 22
Logman Avatar answered Nov 02 '22 17:11

Logman