Let's say I have a number of objects (similar to proteins, but not exactly), each of which is represented by a vector of n 3D coordinates. Each of these objects is oriented somewhere in space. Their similarity can be calculated by aligning them using the Kabsch Algorithm and calculating the root-mean-square deviation of the aligned coordinates.
My question is, what would be the recommended way of clustering a large set of these structures in such a fashion as to extract the most populated cluster (i.e. the one to which most structures belong). Also, is there a way of doing this in python. By way of example, here's a trivialized set of unclustered structures (each is represented by the coordinates of the four vertices):
And then the desired clustering (using two clusters):
I've tried aligning all of the structures to a reference structure (i.e. the first structure) and then performing k-means on the distances between the reference and aligned coordinate using Pycluster.kcluster
, but this seems a little bit clumsy and doesn't work so well. The structures in each cluster don't end up being very similar to each other. Ideally this clustering wouldn't be done on the difference vectors, but rather on the actual structures themselves, but the structures have dimensions (n,3) rather than the (n,) required for k-means clustering.
The other option I tried is scipy.clustering.hierarchical
. This seems to work pretty well, but I'm having trouble deciding which cluster is the most populated since one can always find a more populated cluster by moving up to the next branch of the tree.
Any thoughts or suggestions or ideas about different (already implemented in python) clustering algorithms would be greatly appreciated.
Abstract: Structural clustering is a fundamental graph mining operator which is not only able to find densely-connected clusters, but it can also identify hub vertices and outliers in the graph. Previous structural clustering algorithms are tailored to deterministic graphs.
Within the scope of 3D Geodata, clustering algorithms (also defined as unsupervised segmentation) permit to obtain a segment soup that becomes the backbone of several processes such as feature extraction, classification or 3D modeling as illustrated below.
There are several ways to use clustering as dimension reduction. For the K-means, you can also project the points (orthogonally) onto the vector (or affine) space generated by the centres.
To give an introductory answer to my own question, I'll suggest that one can use the list of distances between each point in the shape as the metric to perform the clustering on.
Let's create some shapes:
shapes = np.array([[[1,4],[4,2],[11,2],[14,0]],
[[4,5],[4,2],[11,2],[13,0]],
[[1,3],[4,2],[11,2],[14,1.5]],
[[3,5],[4,2],[10,7],[7,9]],
[[5,5],[4,2],[10,7],[6,6]]])
def random_rotation():
theta = 3 * np.pi * np.random.random()
rotMatrix = numpy.array([[np.cos(theta), -np.sin(theta)],
[np.sin(theta), np.cos(theta)]])
return rotMatrix
new_shapes = []
for s in shapes:
rr = random_rotation()
new_shapes += [[list(rr.dot(p)) + [0] for p in s]]
new_shapes = np.array(new_shapes)
for i, s in enumerate(new_shapes):
plot(s[:,0], s[:,1], color='black')
text(np.mean(s[:,0]), np.mean(s[:,1]), str(i), fontsize=14)
Then we create some auxiliary functions and create an array containing all the inter-vertex distances for each shape (darray
).
import itertools as it
def vec_distance(v1, v2):
'''
The distance between two vectors.
'''
diff = v2 - v1
return math.sqrt(sum(diff * diff))
def distances(s):
'''
Compute the distance array for a shape s.
'''
ds = [vec_distance(p1, p2) for p1,p2 in it.combinations(s, r=2)]
return np.array(ds)
# create an array of inter-shape distances for each shape
darray = np.array([distances(s) for s in new_shapes])
Cluster them into two clusters using Pycluster
.
import Pycluster as pc
clust = pc.kcluster(darray,2)
print clust
And see that we end up with three entries in the first cluster and two in the other.
(array([0, 0, 0, 1, 1], dtype=int32), 4.576996142441375, 1)
But which shapes do they correspond to?
import brewer2mpl
dark2 = brewer2mpl.get_map('Dark2', 'qualitative', 4).mpl_colors
for i,(s,c) in enumerate(zip(new_shapes, clust[0])):
plot(s[:,0], s[:,1], color=dark2[c])
text(np.mean(s[:,0]), np.mean(s[:,1]), str(i), fontsize=14)
Looks good! The problem is that as shapes get larger, the distance array grows in quadratic time with respect to the number of vertices. I found a presentation that describes this problem and suggests some solutions (like SVD for what I presume is a form of dimensionality reduction) to speed it up.
I'm not going to accept my answer just yet because I'm interested in any other ideas or thoughts about how to approach this simple problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With