I wrote a simple script that is intended to do hierarchical clustering on a simple test dataset.
I found the function fclusterdata to be a candidate to cluster my data into two clusters. It takes two mandatory call parameters: the data set and a threshold. The problem is, I couldn't find a threshold that would yield the expected two clusters.
I'd be happy if anyone can tell me what I am doing wrong. I'd also be happy if anyone could point on other approaches that would be better suited for my clustering (I explicitly want to avoid to specify the number of clusters beforehand.)
Here is my code:
import time
import scipy.cluster.hierarchy as hcluster
import numpy.random as random
import numpy
import pylab
pylab.ion()
data = random.randn(2,200)
data[:100,:100] += 10
for i in range(5,15):
thresh = i/10.
clusters = hcluster.fclusterdata(numpy.transpose(data), thresh)
pylab.scatter(*data[:,:], c=clusters)
pylab.axis("equal")
title = "threshold: %f, number of clusters: %d" % (thresh, len(set(clusters)))
print title
pylab.title(title)
pylab.draw()
time.sleep(0.5)
pylab.clf()
Here is the output:
threshold: 0.500000, number of clusters: 129
threshold: 0.600000, number of clusters: 129
threshold: 0.700000, number of clusters: 129
threshold: 0.800000, number of clusters: 75
threshold: 0.900000, number of clusters: 75
threshold: 1.000000, number of clusters: 73
threshold: 1.100000, number of clusters: 58
threshold: 1.200000, number of clusters: 1
threshold: 1.300000, number of clusters: 1
threshold: 1.400000, number of clusters: 1
Note that the function reference has an error. The correct definition of the t
parameter is: "The cut-off threshold for the cluster function or the maximum number of clusters (criterion=’maxclust’)".
So try this:
clusters = hcluster.fclusterdata(numpy.transpose(data), 2, criterion='maxclust', metric='euclidean', depth=1, method='centroid')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With