Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-learn Agglomerative Clustering Connectivity Matrix

I am attempting to perform constrained clustering using sklearn's agglomerative clustering command. To make the algorithm constrained, it requests a "connectivity matrix". This is described as:

The connectivity constraints are imposed via an connectivity matrix: a scipy sparse matrix that has elements only at the intersection of a row and a column with indices of the dataset that should be connected. This matrix can be constructed from a-priori information: for instance, you may wish to cluster web pages by only merging pages with a link pointing from one to another.

I have a list of observation pairs that I wish the algorithm will force to remain in the same cluster. I can convert this into a sparse scipy matrix (either coo or csr), but the resulting clusters fail to force the constraints.

Some Data:

import numpy as np
import scipy as sp
import pandas as pd
import scipy.sparse as ss
from sklearn.cluster import AgglomerativeClustering


# unique ids 
ids = np.arange(10)

# Pairs that should belong to the same cluster
mustLink = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])

# Features for training the model
data = pd.DataFrame([
[.0873,-1.619,-1.343],
[0.697456, 0.410943, 0.804333],
[-1.295829, -0.709441, -0.376771],
[-0.404985, -0.107366, 0.875791],
[-0.404985, -0.107366,  0.875791],
[-0.515996, 0.731980, -1.569586],
[1.024580,  0.409148, 0.149408],
[-0.074604, 1.269414, 0.115744],
[-0.006706, 2.097276, 0.681819],
[-0.432196, 1.249149,-1.159271]])

Convert the pairs to a "connectivity matrix":

# Blank coo matrix to csr
sm = ss.coo_matrix((len(ids), len(ids)), np.int32).tocsr()
# Insert 1 for connected pairs and diagonals
for i in np.arange(len(mustLink)): # add links to both sides of the matrix
    sm[mustLink.loc[i, 'A'], mustLink.loc[i, 'B']] = 1
    sm[mustLink.loc[i, 'B'], mustLink.loc[i, 'A']] = 1
for i in np.arange(sm.tocsr()[1].shape[1]): # add diagonals
    sm[i,i] = 1
sm = sm.tocoo() # convert back to coo format

Train and fit the agglomerative clustering model:

m = AgglomerativeClustering(n_clusters=6, connectivity=sm)
out = m.fit_predict(X=data)

Warning I receive:

UserWarning: the number of connected components of the connectivity matrix is 7 > 1. Completing it to avoid stopping the tree early. connectivity, n_components = _fix_connectivity(X, connectivity)

In addition to the ominous warning, the pairs that I hoped would belong to the same cluster, do not.

Is this because the sklearn algorithms are not designed to handle mustlink constraints and instead can only use a distance matrix (distinction drawn here)?

like image 903
Michael Davidson Avatar asked Mar 15 '17 22:03

Michael Davidson


1 Answers

When passing a connectivity matrix to sklearn.cluster.AgglomerativeClustering, it is imperative that all points in the matrix be connected. Agglomerative clustering creates a hierarchy, in which all points are iteratively grouped together, so isolated clusters cannot exist. The connectivity matrix is useful to "turn off" connections for points that may be nearby in Euclidean space, but far away from another metric (see the jelly roll example show in the User Guide here).

Another way to think of this is that your points must form a graph that is not disjoint, all you can do is turn off edges between nodes.

This warning:

UserWarning: the number of connected components of the connectivity matrix is 7 > 1. Completing it to avoid stopping the tree early. connectivity, n_components = _fix_connectivity(X, connectivity)

is telling you that you have 7 disjoint clusters, which is more than the 1 which is allowed. As such sklearn "completes" it (essentially fills it in to not have disjoint clusters), which is why your constraints are not respected at all.

There is no easy fix here. You can try reassigning centers after clustering to respect your constraints, or you'll need to employ a different algorithm.

like image 150
yhenon Avatar answered Sep 21 '22 14:09

yhenon