Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Choosing the number of clusters in heirarchical agglomerative clustering with scikit

The wikipedia article on determining the number of clusters in a dataset indicated that I do not need to worry about such a problem when using hierarchical clustering. However when I tried to use scikit-learn's agglomerative clustering I see that I have to feed it the number of clusters as a parameter "n_clusters" - without which I get the hardcoded default of two clusters. How can I go about choosing the right number of cluster's for the dataset in this case? Is the wiki article wrong?

like image 870
DaTaBomB Avatar asked Aug 26 '15 09:08

DaTaBomB


People also ask

How can we decide number of clusters in hierarchical agglomerative clustering?

The number of clusters will be the number of vertical lines which are being intersected by the line drawn using the threshold. In the above example, since the red line intersects 2 vertical lines, we will have 2 clusters.

What's a good way to choose the number of clusters to use?

The “Elbow” Method Probably the most well known method, the elbow method, in which the sum of squares at each number of clusters is calculated and graphed, and the user looks for a change of slope from steep to shallow (an elbow) to determine the optimal number of clusters.

How do you decide the number of clusters needed for the given dataset?

A simple method to calculate the number of clusters is to set the value to about √(n/2) for a dataset of 'n' points.


1 Answers

Wikipedia is simply making an extreme simplification which has nothing to do with real life. Hierarchical clustering does not avoid the problem with number of clusters. Simply - it constructs the tree spaning over all samples, which shows which samples (later on - clusters) merge together to create a bigger cluster. This happend recursively till you have just two clusters (this is why default number of clusters is 2) which are merged to the whole dataset. You are left alone with "cutting" through the tree to get actual clustering. Once you fit AgglomerativeClustering you can traverse the whole tree and analyze which clusters to keep

import numpy as np
from sklearn.cluster import AgglomerativeClustering
import itertools

X = np.concatenate([np.random.randn(3, 10), np.random.randn(2, 10) + 100])
clustering = AgglomerativeClustering()
clustering.fit(X)

[{'node_id': next(itertools.count(X.shape[0])), 'left': x[0], 'right':x[1]} for x in clustering.children_]
like image 117
lejlot Avatar answered Sep 20 '22 05:09

lejlot