I'm trying to use agglomerative clustering with a custom distance metric (ie affinity) since I'd like to cluster a sequence of integers by sequence similarity and not something like the euclidean distance which isn't meaningful.
My data looks something like this
>> dat.values
array([[860, 261, 240, ..., 300, 241, 1],
[860, 840, 860, ..., 860, 240, 1],
[260, 860, 260, ..., 260, 220, 1],
...,
[260, 260, 260, ..., 260, 260, 1],
[260, 860, 260, ..., 840, 860, 1],
[280, 240, 241, ..., 240, 260, 1]])
I've created the following similarity function
def sim(x, y):
return np.sum(np.equal(np.array(x), np.array(y)))/len(x)
So I just return the % matching values in the two sequences with numpy and make the following call
cluster = AgglomerativeClustering(n_clusters=5, affinity=sim, linkage='average')
cluster.fit(dat.values)
But I'm getting an error saying
TypeError: sim() missing 1 required positional argument: 'y'
I'm not sure why I'm getting this error; I thought the function will cluster pairs of rows so each required argument would be passed.
Any help with this would be greatly appreciated
'affinity'
as a callable requires a single input X
(which is your feature or observation matrix) and then call the distances between all the points (samples) inside it.
So you need to modify your method as:
# Your method to calculate distance between two samples
def sim(x, y):
return np.sum(np.equal(np.array(x), np.array(y)))/len(x)
# Method to calculate distances between all sample pairs
from sklearn.metrics import pairwise_distances
def sim_affinity(X):
return pairwise_distances(X, metric=sim)
cluster = AgglomerativeClustering(n_clusters=5, affinity=sim_affinity, linkage='average')
cluster.fit(X)
Or you can use affinity='precomputed'
as @avchauzov has suggested. For that you will have to pass the pre-calculated distance matrix for your observations in fit()
. Something like:
cluster = AgglomerativeClustering(n_clusters=5, affinity='precomputed', linkage='average')
distance_matrix = sim_affinity(X)
cluster.fit(distance_matrix)
Note: You have specified similarity in place of distance. So make sure you understand how the clustering will work here. Or maybe tweak your similarity function to return distance. Something like:
def sim(x, y):
# Subtracted from 1.0 (highest similarity), so now it represents distance
return 1.0 - np.sum(np.equal(np.array(x), np.array(y)))/len(x)
The common way to do it is to put affinity='precomputed
and fit the distance matrix (see example here: https://gist.github.com/codehacken/8b9316e025beeabb082dda4d0654a6fa)
UPD In sklearn.hierarchical.py (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/hierarchical.py#L460) you can see that your custom affinity has to get only X (not y) as the input. And the input should be the linkage_tree. So, you need to rewrite your sim() function.
But in my opinion the first way is much more convenient.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With