Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can't get scipy hierarchical clustering to work

I wrote a simple script that is intended to do hierarchical clustering on a simple test dataset. The test data that was used.

I found the function fclusterdata to be a candidate to cluster my data into two clusters. It takes two mandatory call parameters: the data set and a threshold. The problem is, I couldn't find a threshold that would yield the expected two clusters.

I'd be happy if anyone can tell me what I am doing wrong. I'd also be happy if anyone could point on other approaches that would be better suited for my clustering (I explicitly want to avoid to specify the number of clusters beforehand.)

Here is my code:

import time
import scipy.cluster.hierarchy as hcluster
import numpy.random as random
import numpy

import pylab
pylab.ion()

data = random.randn(2,200)

data[:100,:100] += 10

for i in range(5,15):
    thresh = i/10.
    clusters = hcluster.fclusterdata(numpy.transpose(data), thresh)
    pylab.scatter(*data[:,:], c=clusters)
    pylab.axis("equal")
    title = "threshold: %f, number of clusters: %d" % (thresh, len(set(clusters)))
    print title
    pylab.title(title)
    pylab.draw()
    time.sleep(0.5)
    pylab.clf()

Here is the output:

threshold: 0.500000, number of clusters: 129
threshold: 0.600000, number of clusters: 129
threshold: 0.700000, number of clusters: 129
threshold: 0.800000, number of clusters: 75
threshold: 0.900000, number of clusters: 75
threshold: 1.000000, number of clusters: 73
threshold: 1.100000, number of clusters: 58
threshold: 1.200000, number of clusters: 1
threshold: 1.300000, number of clusters: 1
threshold: 1.400000, number of clusters: 1
like image 877
moooeeeep Avatar asked Mar 26 '12 14:03

moooeeeep


1 Answers

Note that the function reference has an error. The correct definition of the t parameter is: "The cut-off threshold for the cluster function or the maximum number of clusters (criterion=’maxclust’)".

So try this:

clusters = hcluster.fclusterdata(numpy.transpose(data), 2, criterion='maxclust', metric='euclidean', depth=1, method='centroid')
like image 186
Diego Avatar answered Oct 18 '22 09:10

Diego