I have a similarity matrix between four users. I want to do an agglomerative clustering. the code is like this:
lena = np.matrix('1 1 0 0;1 1 0 0;0 0 1 0.2;0 0 0.2 1')
X = np.reshape(lena, (-1, 1))
print("Compute structured hierarchical clustering...")
st = time.time()
n_clusters = 3 # number of regionsle
ward = AgglomerativeClustering(n_clusters=n_clusters,
linkage='complete').fit(X)
print ward
label = np.reshape(ward.labels_, lena.shape)
print("Elapsed time: ", time.time() - st)
print("Number of pixels: ", label.size)
print("Number of clusters: ", np.unique(label).size)
print label
the print result of label is like:
[[1 1 0 0]
[1 1 0 0]
[0 0 1 2]
[0 0 2 1]]
Does this mean it gives a lists of possible cluster result, we can choose one from them? like choosing: [0,0,2,1]. If is wrong, could you tell me how to do the agglomerative algorithm based on similarity? If it'ss right, the similarity matrix is huge, how can i choose the optimal clustering result from a huge list? Thanks
One drawback is that groups with close pairs can merge sooner than is optimal, even if those groups have overall dissimilarity. Complete Linkage: calculates similarity of the farthest away pair. One disadvantage to this method is that outliers can cause less-than-optimal merging.
Agglomerative clustering but for features instead of samples. Hierarchical clustering with ward linkage. Fit the hierarchical clustering from features, or distance matrix. Fit and return the result of each sample's clustering assignment.
I think the problem here is that you fit your model with the wrong data
# This will return a 4x4 matrix (similarity matrix)
lena = np.matrix('1 1 0 0;1 1 0 0;0 0 1 0.2;0 0 0.2 1')
# However this will return 16x1 matrix
X = np.reshape(lena, (-1, 1))
The true result you get is this:
ward.labels_
>> array([1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 2, 0, 0, 2, 1])
Which is the label of each element in the X vector and it don't make sens
If I well understood your problem, you need to classify your users by distance between them (similarity). Well, in this case I will suggest to use spectral clustering this way:
import numpy as np
from sklearn.cluster import SpectralClustering
lena = np.matrix('1 1 0 0;1 1 0 0;0 0 1 0.2;0 0 0.2 1')
n_clusters = 3
SpectralClustering(n_clusters).fit_predict(lena)
>> array([1, 1, 0, 2], dtype=int32)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With