Is there an efficient way to cluster nodes in a graph using Jaccard similarity such that each cluster has at least K
nodes?
Jaccard similarity between nodes i
and j
:
Let S
be the set of neighbours of i
and T
be the set of neighbours of j
. Then the similarity between i
and j
is given by |(S ⋂ T)| / |(S ⋃ T)|
.
Have you tried implementing some algorithm yourself?
Compute all pairwise non-zero similarities (i.e. when they have at least one neighbor in common; this makes the candidate set much smaller than a squared matrix).
Sort them by similarity, and process pairs in decreasing similarity. Initially, each object is their own cluster.
When A and B are not yet in the same cluster, and either cluster has less than k members, join the two clusters. Repeat until all similarities have been processed.
Note that you may still end up having clusters with less than k members. For example, if your data set has less than k nodes total, or there are small subgraphs that are not connected etc.
You really should accept clusters of less than k nodes, i.e. unclustered nodes. Why would everything cluster? There always will be outliers and noise in real data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With