I am using SciPy's hierarchical agglomerative clustering methods to cluster a m x n matrix of features, but after the clustering is complete, I can't seem to figure out how to get the centroid from the resulting clusters. Below follows my code:
Y = distance.pdist(features)
Z = hierarchy.linkage(Y, method = "average", metric = "euclidean")
T = hierarchy.fcluster(Z, 100, criterion = "maxclust")
I am taking my matrix of features, computing the euclidean distance between them, and then passing them onto the hierarchical clustering method. From there, I am creating flat clusters, with a maximum of 100 clusters
Now, based on the flat clusters T, how do I get the 1 x n centroid that represents each flat cluster?
To calculate the centroid from the cluster table just get the position of all points of a single cluster, sum them up and divide by the number of points.
Observations are allocated to clusters by drawing a horizontal line through the dendrogram. Observations that are joined together below the line are in clusters. In the example below, we have two clusters. One cluster combines A and B, and a second cluster combining C, D, E, and F.
Agglomerative clustering works in a “bottom-up” manner. That is, each object is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes).
A possible solution is a function, which returns a codebook with the centroids like kmeans
in scipy.cluster.vq
does. Only thing you need is the partition as vector with flat clusters part
and the original observations X
def to_codebook(X, part):
"""
Calculates centroids according to flat cluster assignment
Parameters
----------
X : array, (n, d)
The n original observations with d features
part : array, (n)
Partition vector. p[n]=c is the cluster assigned to observation n
Returns
-------
codebook : array, (k, d)
Returns a k x d codebook with k centroids
"""
codebook = []
for i in range(part.min(), part.max()+1):
codebook.append(X[part == i].mean(0))
return np.vstack(codebook)
You can do something like this (D
=number of dimensions):
# Sum the vectors in each cluster
lens = {} # will contain the lengths for each cluster
centroids = {} # will contain the centroids of each cluster
for idx,clno in enumerate(T):
centroids.setdefault(clno,np.zeros(D))
centroids[clno] += features[idx,:]
lens.setdefault(clno,0)
lens[clno] += 1
# Divide by number of observations in each cluster to get the centroid
for clno in centroids:
centroids[clno] /= float(lens[clno])
This will give you a dictionary with cluster number as the key and the centroid of the specific cluster as the value.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With