I used clustal omega to get a distance matrix of 500 protein sequences (they are homologous to each other).
I want to use affinity propagation to cluster these sequences.
Initially, because I observed by hand that the distance matrix only had values between 0 and 1, with 0 distance = 100% identity, I reasoned that I could just take (1 - distance)
to get affinity.
I ran my code, and the clusters looked reasonable, and I thought all was well... until I read that typically, affinity matrices are calculated from distance matrices by applying a "heat kernel". That's when all hell broke loose in my mind.
Did I get the concept of affinity matrix incorrect? Is there an easy way of computing the affinity matrix? scikit-learn offers the following formula:
similarity = np.exp(-beta * distance / distance.std())
But what is beta? I know distance.std() is the standard deviation of the distance.
I'm quite confused and lost right now with the concepts involved (as opposed to the actual coding implementation), so any help is greatly appreciated!
P.S. I've tried posting to Biostars.org, but I haven't gotten an answer there...
I think both 1-distance and exp(-beta * distance) are valid approaches to convert a distance into a similarity (though they differ in terms of their interpretation in a probabilistic framework). I would simply use what gives the better results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With