I have a simple 2-dimensional dataset that I wish to cluster in an agglomerative manner (not knowing the optimal number of clusters to use). The only way I've been able to cluster my data successfully is by giving the function a 'maxclust' value.
For simplicity's sake, let's say this is my dataset:
X=[ 1,1;
1,2;
2,2;
2,1;
5,4;
5,5;
6,5;
6,4 ];
Naturally, I would want this data to form 2 clusters. I understand that if I knew this, I could just say:
T = clusterdata(X,'maxclust',2);
and to find which points fall into each cluster I could say:
cluster_1 = X(T==1, :);
and
cluster_2 = X(T==2, :);
but without knowing that 2 clusters would be optimal for this dataset, how do I cluster these data?
Thanks
Clustering process is terminated when the minimum distance between nearest clusters exceeds an arbitrary threshold. Example: Suppose this data is to be clustered. In this example, cutting the tree after the second row of the dendrogram will yield clusters {a} {b c} {d e} {f}.
Agglomerative Clustering is a type of hierarchical clustering algorithm. It is an unsupervised machine learning technique that divides the population into several clusters such that data points in the same cluster are more similar and data points in different clusters are dissimilar.
Agglomerative Clustering is a bottom-up strategy in which each data point is originally a cluster of its own, and as one travels up the hierarchy, more pairs of clusters are combined. In it, two nearest clusters are taken and joined to form one single cluster.
The whole point of this method is that it represents the clusters found in a hierarchy, and it is up to you to determine how much details you want to get..
Think of this as having a horizontal line intersecting the dendrogram, which moves starting from 0 (each point is its own cluster) all the way to the max value (all points in one cluster). You could:
This can be done by either using the 'maxclust'
or 'cutoff'
arguments of the CLUSTER/CLUSTERDATA functions
To choose the optimal number of clusters, one common approach is to make a plot similar to a Scree Plot. Then you look for the "elbow" in the plot, and that is the number of clusters you pick. For the criterion here, we will use the within-cluster sum-of-squares:
function wss = plotScree(X, n)
wss = zeros(1, n);
wss(1) = (size(X, 1)-1) * sum(var(X, [], 1));
for i=2:n
T = clusterdata(X,'maxclust',i);
wss(i) = sum((grpstats(T, T, 'numel')-1) .* sum(grpstats(X, T, 'var'), 2));
end
hold on
plot(wss)
plot(wss, '.')
xlabel('Number of clusters')
ylabel('Within-cluster sum-of-squares')
>> plotScree(X, 5)
ans =
54.0000 4.0000 3.3333 2.5000 2.0000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With