Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get centroids from SciPy's hierarchical agglomerative clustering?

I am using SciPy's hierarchical agglomerative clustering methods to cluster a m x n matrix of features, but after the clustering is complete, I can't seem to figure out how to get the centroid from the resulting clusters. Below follows my code:

Y = distance.pdist(features)
Z = hierarchy.linkage(Y, method = "average", metric = "euclidean")
T = hierarchy.fcluster(Z, 100, criterion = "maxclust")

I am taking my matrix of features, computing the euclidean distance between them, and then passing them onto the hierarchical clustering method. From there, I am creating flat clusters, with a maximum of 100 clusters

Now, based on the flat clusters T, how do I get the 1 x n centroid that represents each flat cluster?

like image 542
Adrian Rosebrock Avatar asked Feb 20 '12 13:02

Adrian Rosebrock


People also ask

How do you find the centroid in hierarchical clustering?

To calculate the centroid from the cluster table just get the position of all points of a single cluster, sum them up and divide by the number of points.

How do you derive clusters from dendrogram?

Observations are allocated to clusters by drawing a horizontal line through the dendrogram. Observations that are joined together below the line are in clusters. In the example below, we have two clusters. One cluster combines A and B, and a second cluster combining C, D, E, and F.

How are objects clustered in agglomerative hierarchical clustering?

Agglomerative clustering works in a “bottom-up” manner. That is, each object is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes).


2 Answers

A possible solution is a function, which returns a codebook with the centroids like kmeans in scipy.cluster.vq does. Only thing you need is the partition as vector with flat clusters part and the original observations X

def to_codebook(X, part):
    """
    Calculates centroids according to flat cluster assignment

    Parameters
    ----------
    X : array, (n, d)
        The n original observations with d features

    part : array, (n)
        Partition vector. p[n]=c is the cluster assigned to observation n

    Returns
    -------
    codebook : array, (k, d)
        Returns a k x d codebook with k centroids
    """
    codebook = []

    for i in range(part.min(), part.max()+1):
        codebook.append(X[part == i].mean(0))

    return np.vstack(codebook)
like image 169
embert Avatar answered Sep 25 '22 10:09

embert


You can do something like this (D=number of dimensions):

# Sum the vectors in each cluster
lens = {}      # will contain the lengths for each cluster
centroids = {} # will contain the centroids of each cluster
for idx,clno in enumerate(T):
    centroids.setdefault(clno,np.zeros(D)) 
    centroids[clno] += features[idx,:]
    lens.setdefault(clno,0)
    lens[clno] += 1
# Divide by number of observations in each cluster to get the centroid
for clno in centroids:
    centroids[clno] /= float(lens[clno])

This will give you a dictionary with cluster number as the key and the centroid of the specific cluster as the value.

like image 33
dkar Avatar answered Sep 24 '22 10:09

dkar