I'm having trouble understanding a specific use case of the sklearn.cluster.SpectralClustering
class as outlined in the official documentation here. Say I want to use my own affinity matrix to perform clustering. I first instantiate an object of class SpectralClustering
as follows:
from sklearn.clustering import SpectralClustering
cl = SpectralClustering(n_clusters=5,affinity='precomputed')
The documentation for the affinity
parameter above is as follows:
affinity : string, array-like or callable, default ‘rbf’
If a string, this may be one of ‘nearest_neighbors’, ‘precomputed’, ‘rbf’ or one of the kernels supported by sklearn.metrics.pairwise_kernels. Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm.
Now the object cl
has a method fit
for which the documentation about its sole parameter X
is as follows:
X : array-like or sparse matrix, shape (n_samples, n_features)
OR, if affinity==
precomputed
, a precomputed affinity matrix of shape (n_samples, n_samples)
This is where it gets confusing. I am using my own affinity matrix, where a measure of 0 means two points are identical, with a higher number meaning two points are more dissimilar. However, the other choices for the parameter affinity
actually take a data set and produce a similarity matrix, for which higher values are indicative of more similarity, and lower values indicate dissimilarity (such as the radial basis kernel).
So when using the fit
method on my instance of SpectralClustering
, do I actually need to transform my affinity matrix into a similarity matrix before passing it to the fit
method call as the parameter X
? The same documentation page makes a note on transforming distance to well-behaved similarities, but does not explicitly indicate where this step should should be carried out, and via which method call.
Basic Terminologies Used in Spectral ClusteringAn Affinity Matrix is like an Adjacency Matrix, except the value for a pair of points expresses how similar those points are to each other. If pairs of points are very dissimilar then the affinity should be 0. If the points are identical, then the affinity might be 1.
Unlike clustering algorithms such as k-means or k-medoids, affinity propagation does not require the number of clusters to be determined or estimated before running the algorithm, for this purpose the two important parameters are the preference, which controls how many exemplars (or prototypes) are used, and the ...
It stands for “Density-based spatial clustering of applications with noise”. This algorithm is based on the intuitive notion of “clusters” & “noise” that clusters are dense regions of the lower density in the data space, separated by lower density regions of data points. Scikit-learn have sklearn. cluster.
If you have a similarity matrix, try to use Spectral methods for clustering. Take a look at Laplacian Eigenmaps for example. The idea is to compute eigenvectors from the Laplacian matrix (computed from the similarity matrix) and then come up with the feature vectors (one for each element) that respect the similarities.
Straight from the docs:
If you have an affinity matrix, such as a distance matrix, for which 0 means identical elements, and high values means very dissimilar elements, it can be transformed in a similarity matrix that is well suited for the algorithm by applying the Gaussian (RBF, heat) kernel:
np.exp(- X ** 2 / (2. * delta ** 2))
This goes in your own code, and the result of this can be passed to fit
. For the purpose of this algorithm, affinity means similarity, not distance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With