Equivalent of Matlab's cluster quality function?

1 Answers

I present below a sample silhouette implementation in both MATLAB and Python/Numpy (keep in mind that I am more fluent in MATLAB):

1) MATLAB

function s = mySilhouette(X, IDX)
    %# X  : matrix of size N-by-p, data where rows are instances
    %# IDX: vector of size N, cluster index of each instance (starting from 1)
    %# s  : vector of size N, silhouette score value of each instance

    N = size(X,1);            %# number of instances
    K = numel(unique(IDX));   %# number of clusters

    %# compute pairwise distance matrix
    D = squareform( pdist(X,'euclidean').^2 );

    %# indices belonging to each cluster
    kIndices = accumarray(IDX, 1:N, [K 1], @(x){sort(x)});

    %# compute a,b,s for each instance
    %# a(i): average distance from i to all other data within the same cluster.
    %# b(i): lowest average dist from i to the data of another single cluster
    a = zeros(N,1);
    b = zeros(N,1);
    for i=1:N
        ind = kIndices{IDX(i)}; ind = ind(ind~=i);
        a(i) = mean( D(i,ind) );
        b(i) = min( cellfun(@(ind) mean(D(i,ind)), kIndices([1:K]~=IDX(i))) );
    end
    s = (b-a) ./ max(a,b);
end

To emulate the plot from the silhouette function in MATLAB, we group the silhouette values by cluster, sort within each, then plot the bars horizontally. MATLAB adds NaNs to separate the bars from the different clusters, I found it easier to simply color-code the bars:

%# sample data
load fisheriris
X = meas;
N = size(X,1);

%# cluster and compute silhouette score
K = 3;
[IDX,C] = kmeans(X, K, 'distance','sqEuclidean');
s = mySilhouette(X, IDX);

%# plot
[~,ord] = sortrows([IDX s],[1 -2]);
indices = accumarray(IDX(ord), 1:N, [K 1], @(x){sort(x)});
ytick = cellfun(@(ind) (min(ind)+max(ind))/2, indices);
ytickLabels = num2str((1:K)','%d');           %#'

h = barh(1:N, s(ord),'hist');
set(h, 'EdgeColor','none', 'CData',IDX(ord))
set(gca, 'CLim',[1 K], 'CLimMode','manual')
set(gca, 'YDir','reverse', 'YTick',ytick, 'YTickLabel',ytickLabels)
xlabel('Silhouette Value'), ylabel('Cluster')

%# compare against SILHOUETTE
figure, silhouette(X,IDX)

mySilhouette

2) Python

And here is what I came up with in Python:

import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import pdist, squareform
from sklearn import datasets
import matplotlib.pyplot as plt
from matplotlib import cm

def silhouette(X, cIDX):
    """
    Computes the silhouette score for each instance of a clustered dataset,
    which is defined as:
        s(i) = (b(i)-a(i)) / max{a(i),b(i)}
    with:
        -1 <= s(i) <= 1

    Args:
        X    : A M-by-N array of M observations in N dimensions
        cIDX : array of len M containing cluster indices (starting from zero)

    Returns:
        s    : silhouette value of each observation
    """

    N = X.shape[0]              # number of instances
    K = len(np.unique(cIDX))    # number of clusters

    # compute pairwise distance matrix
    D = squareform(pdist(X))

    # indices belonging to each cluster
    kIndices = [np.flatnonzero(cIDX==k) for k in range(K)]

    # compute a,b,s for each instance
    a = np.zeros(N)
    b = np.zeros(N)
    for i in range(N):
        # instances in same cluster other than instance itself
        a[i] = np.mean( [D[i][ind] for ind in kIndices[cIDX[i]] if ind!=i] )
        # instances in other clusters, one cluster at a time
        b[i] = np.min( [np.mean(D[i][ind]) 
                        for k,ind in enumerate(kIndices) if cIDX[i]!=k] )
    s = (b-a)/np.maximum(a,b)

    return s

def main():
    # load Iris dataset
    data = datasets.load_iris()
    X = data['data']

    # cluster and compute silhouette score
    K = 3
    C, cIDX = kmeans2(X, K)
    s = silhouette(X, cIDX)

    # plot
    order = np.lexsort((-s,cIDX))
    indices = [np.flatnonzero(cIDX[order]==k) for k in range(K)]
    ytick = [(np.max(ind)+np.min(ind))/2 for ind in indices]
    ytickLabels = ["%d" % x for x in range(K)]
    cmap = cm.jet( np.linspace(0,1,K) ).tolist()
    clr = [cmap[i] for i in cIDX[order]]

    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.barh(range(X.shape[0]), s[order], height=1.0, 
            edgecolor='none', color=clr)
    ax.set_ylim(ax.get_ylim()[::-1])
    plt.yticks(ytick, ytickLabels)
    plt.xlabel('Silhouette Value')
    plt.ylabel('Cluster')
    plt.show()

if __name__ == '__main__':
    main()

python_mySilhouette

Update:

As noted by others, scikit-learn has since then added its own silhouette metric implementation. To use it in the above code, replace the call to the custom-defined silhouette function with:

from sklearn.metrics import silhouette_samples

...

#s = silhouette(X, cIDX)
s = silhouette_samples(X, cIDX)    # <-- scikit-learn function

...

the rest of the code can still be used as-is to generate the exact same plot.

answered Oct 04 '22 14:10

Amro

Related questions
                            
                                How to create modules in Jupyter notebook and import them? Python
                            
                                Recursion: account value with distributions
                            
                                How to start debugger with an entry_point script
                            
                                Specific package version with conda-forge
                            
                                Large Performance difference when summing ints vs floats in Cython vs NumPy
                            
                                pytest ScopeMismatch error: how to use fixtures properly
                            
                                How can I get pip to list the repositories it's using?
                            
                                Python readline with custom delimiter
                            
                                Why is my venv using a different pip version than I have installed
                            
                                How to check to make sure all items in a list are of a certain type [duplicate]
                            
                                Can I create a Python extension module in D (instead of C)
                            
                                Python/PySerial and CPU usage
                            
                                SQLAlchemy getting column data types of query results
                            
                                How does git-diff generate hunk descriptions?
                            
                                Multiple XML Namespaces in tag with LXML
                            
                                Disabling Django CSRF for views that do not always have a response
                            
                                Parsing HTML with Lxml
                            
                                How to keep a developer key secret in a Python script that is hosted on GitHub
                            
                                In Django, how can I get an exception's message?
                            
                                PyCharm: How to skip over closing braces / brackets / parentheses?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Equivalent of Matlab's cluster quality function?

Tags:

python

numpy

scipy

matlab

cluster-analysis

Legend

People also ask