Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn agglomerative clustering input data

I have a similarity matrix between four users. I want to do an agglomerative clustering. the code is like this:

lena = np.matrix('1 1 0 0;1 1 0 0;0 0 1 0.2;0 0 0.2 1')
X = np.reshape(lena, (-1, 1))

print("Compute structured hierarchical clustering...")
st = time.time()
n_clusters = 3 # number of regionsle


ward = AgglomerativeClustering(n_clusters=n_clusters,
        linkage='complete').fit(X)
print ward
label = np.reshape(ward.labels_, lena.shape)
print("Elapsed time: ", time.time() - st)
print("Number of pixels: ", label.size)
print("Number of clusters: ", np.unique(label).size)
print label

the print result of label is like:

[[1 1 0 0]
 [1 1 0 0]
 [0 0 1 2]
 [0 0 2 1]]

Does this mean it gives a lists of possible cluster result, we can choose one from them? like choosing: [0,0,2,1]. If is wrong, could you tell me how to do the agglomerative algorithm based on similarity? If it'ss right, the similarity matrix is huge, how can i choose the optimal clustering result from a huge list? Thanks

like image 210
printemp Avatar asked Oct 08 '15 18:10

printemp


People also ask

What are the disadvantages of agglomerative hierarchical clustering?

One drawback is that groups with close pairs can merge sooner than is optimal, even if those groups have overall dissimilarity. Complete Linkage: calculates similarity of the farthest away pair. One disadvantage to this method is that outliers can cause less-than-optimal merging.

What is the difference between Agglomerative Clustering and hierarchical clustering in Scikit learn?

Agglomerative clustering but for features instead of samples. Hierarchical clustering with ward linkage. Fit the hierarchical clustering from features, or distance matrix. Fit and return the result of each sample's clustering assignment.


1 Answers

I think the problem here is that you fit your model with the wrong data

# This will return a 4x4 matrix (similarity matrix)
lena = np.matrix('1 1 0 0;1 1 0 0;0 0 1 0.2;0 0 0.2 1')

# However this will return 16x1 matrix
X = np.reshape(lena, (-1, 1))

The true result you get is this:

 ward.labels_
 >> array([1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 2, 0, 0, 2, 1])

Which is the label of each element in the X vector and it don't make sens

If I well understood your problem, you need to classify your users by distance between them (similarity). Well, in this case I will suggest to use spectral clustering this way:

import numpy as np
from sklearn.cluster import SpectralClustering

lena = np.matrix('1 1 0 0;1 1 0 0;0 0 1 0.2;0 0 0.2 1')

n_clusters = 3
SpectralClustering(n_clusters).fit_predict(lena)

>> array([1, 1, 0, 2], dtype=int32)
like image 141
farhawa Avatar answered Oct 01 '22 13:10

farhawa