Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sklearn Agglomerative Clustering Custom Affinity

I'm trying to use agglomerative clustering with a custom distance metric (ie affinity) since I'd like to cluster a sequence of integers by sequence similarity and not something like the euclidean distance which isn't meaningful.

My data looks something like this

>> dat.values 

array([[860, 261, 240, ..., 300, 241,   1],
   [860, 840, 860, ..., 860, 240,   1],
   [260, 860, 260, ..., 260, 220,   1],
   ...,
   [260, 260, 260, ..., 260, 260,   1],
   [260, 860, 260, ..., 840, 860,   1],
   [280, 240, 241, ..., 240, 260,   1]]) 

I've created the following similarity function

def sim(x, y): 
    return np.sum(np.equal(np.array(x), np.array(y)))/len(x)

So I just return the % matching values in the two sequences with numpy and make the following call

cluster = AgglomerativeClustering(n_clusters=5, affinity=sim, linkage='average')
cluster.fit(dat.values)

But I'm getting an error saying

TypeError: sim() missing 1 required positional argument: 'y'

I'm not sure why I'm getting this error; I thought the function will cluster pairs of rows so each required argument would be passed.

Any help with this would be greatly appreciated

like image 222
ApprenticeOfMathematics Avatar asked Dec 19 '18 10:12

ApprenticeOfMathematics


Video Answer


2 Answers

'affinity' as a callable requires a single input X (which is your feature or observation matrix) and then call the distances between all the points (samples) inside it.

So you need to modify your method as:

# Your method to calculate distance between two samples
def sim(x, y): 
    return np.sum(np.equal(np.array(x), np.array(y)))/len(x)


# Method to calculate distances between all sample pairs
from sklearn.metrics import pairwise_distances
def sim_affinity(X):
    return pairwise_distances(X, metric=sim)

cluster = AgglomerativeClustering(n_clusters=5, affinity=sim_affinity, linkage='average')
cluster.fit(X)

Or you can use affinity='precomputed' as @avchauzov has suggested. For that you will have to pass the pre-calculated distance matrix for your observations in fit(). Something like:

cluster = AgglomerativeClustering(n_clusters=5, affinity='precomputed', linkage='average')
distance_matrix = sim_affinity(X)
cluster.fit(distance_matrix)

Note: You have specified similarity in place of distance. So make sure you understand how the clustering will work here. Or maybe tweak your similarity function to return distance. Something like:

def sim(x, y): 
    # Subtracted from 1.0 (highest similarity), so now it represents distance
    return 1.0 - np.sum(np.equal(np.array(x), np.array(y)))/len(x)
like image 139
Vivek Kumar Avatar answered Oct 08 '22 18:10

Vivek Kumar


The common way to do it is to put affinity='precomputed and fit the distance matrix (see example here: https://gist.github.com/codehacken/8b9316e025beeabb082dda4d0654a6fa)

UPD In sklearn.hierarchical.py (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/hierarchical.py#L460) you can see that your custom affinity has to get only X (not y) as the input. And the input should be the linkage_tree. So, you need to rewrite your sim() function.

But in my opinion the first way is much more convenient.

like image 39
avchauzov Avatar answered Oct 08 '22 16:10

avchauzov