What is the best way to test a clustering algorithm? I am using an agglomerative clustering algorithm with a stop criterion. How do I test if the clusters are formed correctly or not?
A good rule of thumb for evaluating how much a graph can be clustered (on a coarse grained level) has to do with the "eigenvalue gap". Given a weighted graph A
, calculate the eigenvalues and sort them (this is the eigenvalue spectrum). When plotted, if there is a large jump in the spectrum at some point, there is a natural corresponding block to partition the graph.
Below is an example (in numpy python) that shows, given an almost block diagonal matrix there a large gap in the eigenvalue spectrum at the number of blocks (parameterized by c
in the code). Note that a matrix permutation (identical to labeling your graph nodes) still gives the same spectral gap:
from numpy import *
import pylab as plt
# Make a block diagonal matrix
N = 30
c = 5
A = zeros((N*c,N*c))
for m in xrange(c):
A[m*N:(m+1)*N, m*N:(m+1)*N] = random.random((N,N))
# Add some noise
A += random.random(A.shape) * 0.1
# Make symmetric
A += A.T - diag(A.diagonal())
# Show the original matrix
plt.subplot(131)
plt.imshow(A.copy(), interpolation='nearest')
# Permute the matrix for effect
idx = random.permutation(N*c)
A = A[idx,:][:,idx]
# Compute eigenvalues
L = linalg.eigvalsh(A)
# Show the results
plt.subplot(132)
plt.imshow(A, interpolation='nearest')
plt.subplot(133)
plt.plot(sorted(L,reverse=True))
plt.plot([c-.5,c-.5],[0,max(L)],'r--')
plt.ylim(0,max(L))
plt.xlim(0,20)
plt.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With