sklearn agglomerative clustering input data

Tags:

I have a similarity matrix between four users. I want to do an agglomerative clustering. the code is like this:

lena = np.matrix('1 1 0 0;1 1 0 0;0 0 1 0.2;0 0 0.2 1')
X = np.reshape(lena, (-1, 1))

print("Compute structured hierarchical clustering...")
st = time.time()
n_clusters = 3 # number of regionsle


ward = AgglomerativeClustering(n_clusters=n_clusters,
        linkage='complete').fit(X)
print ward
label = np.reshape(ward.labels_, lena.shape)
print("Elapsed time: ", time.time() - st)
print("Number of pixels: ", label.size)
print("Number of clusters: ", np.unique(label).size)
print label

the print result of label is like:

[[1 1 0 0]
 [1 1 0 0]
 [0 0 1 2]
 [0 0 2 1]]

Does this mean it gives a lists of possible cluster result, we can choose one from them? like choosing: [0,0,2,1]. If is wrong, could you tell me how to do the agglomerative algorithm based on similarity? If it'ss right, the similarity matrix is huge, how can i choose the optimal clustering result from a huge list? Thanks

210

asked Oct 08 '15 18:10

printemp

1 Answers

I think the problem here is that you fit your model with the wrong data

# This will return a 4x4 matrix (similarity matrix)
lena = np.matrix('1 1 0 0;1 1 0 0;0 0 1 0.2;0 0 0.2 1')

# However this will return 16x1 matrix
X = np.reshape(lena, (-1, 1))

The true result you get is this:

 ward.labels_
 >> array([1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 2, 0, 0, 2, 1])

Which is the label of each element in the X vector and it don't make sens

If I well understood your problem, you need to classify your users by distance between them (similarity). Well, in this case I will suggest to use spectral clustering this way:

import numpy as np
from sklearn.cluster import SpectralClustering

lena = np.matrix('1 1 0 0;1 1 0 0;0 0 1 0.2;0 0 0.2 1')

n_clusters = 3
SpectralClustering(n_clusters).fit_predict(lena)

>> array([1, 1, 0, 2], dtype=int32)

141

answered Oct 01 '22 13:10

farhawa

Related questions
                            
                                Parse SQL Script to extract table and column names
                            
                                Count occurrences of digit 'x' in range (0,n]
                            
                                Selenium: Run test on my machine remotely?
                            
                                How to install a Python Windows service using cx_Freeze?
                            
                                Filter and Sort on Custom Field in Flask-admin ModelView
                            
                                Set space between boxplots in Python Graphs generated nested box plots with Seaborn?
                            
                                What can I do to speed up Stanford CoreNLP (dcoref/ner)?
                            
                                numpy array from csv file for lasagne
                            
                                Python: How to replace text in pdf
                            
                                How to get PyQt4 working with PyCharm
                            
                                Is there a way to access a function's attributes/parameters within a ContextDecorator?
                            
                                numpy "Mean of empty slice." warning
                            
                                Resampling in Pandas while keeping value associations
                            
                                loop to make every combination of several lists
                            
                                How to split a sorted list into sub lists when two neighboring value difference is larger than a threshold
                            
                                ffmpeg in Python subprocess - Unable to find a suitable output format for 'pipe:'
                            
                                What should a Python project structure look like for Travis CI to find and run tests?
                            
                                Image to text recognition using Tesseract-OCR is better when Image is preprocessed manually using Gimp than my Python Code
                            
                                Using numba.jit with scipy.integrate.ode
                            
                                Is it possible to output to and monitor streams other than stdin, stdout & stderr? (python)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

sklearn agglomerative clustering input data

Tags:

python

scikit-learn

hierarchical-clustering

printemp

People also ask

1 Answers

farhawa

Recent Activity

Donate For Us