Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering structural 3D data

Let's say I have a number of objects (similar to proteins, but not exactly), each of which is represented by a vector of n 3D coordinates. Each of these objects is oriented somewhere in space. Their similarity can be calculated by aligning them using the Kabsch Algorithm and calculating the root-mean-square deviation of the aligned coordinates.

My question is, what would be the recommended way of clustering a large set of these structures in such a fashion as to extract the most populated cluster (i.e. the one to which most structures belong). Also, is there a way of doing this in python. By way of example, here's a trivialized set of unclustered structures (each is represented by the coordinates of the four vertices):

enter image description here

And then the desired clustering (using two clusters):

enter image description here

I've tried aligning all of the structures to a reference structure (i.e. the first structure) and then performing k-means on the distances between the reference and aligned coordinate using Pycluster.kcluster, but this seems a little bit clumsy and doesn't work so well. The structures in each cluster don't end up being very similar to each other. Ideally this clustering wouldn't be done on the difference vectors, but rather on the actual structures themselves, but the structures have dimensions (n,3) rather than the (n,) required for k-means clustering.

The other option I tried is scipy.clustering.hierarchical. This seems to work pretty well, but I'm having trouble deciding which cluster is the most populated since one can always find a more populated cluster by moving up to the next branch of the tree.

Any thoughts or suggestions or ideas about different (already implemented in python) clustering algorithms would be greatly appreciated.

like image 549
juniper- Avatar asked Mar 14 '14 09:03

juniper-


People also ask

What is structural clustering?

Abstract: Structural clustering is a fundamental graph mining operator which is not only able to find densely-connected clusters, but it can also identify hub vertices and outliers in the graph. Previous structural clustering algorithms are tailored to deterministic graphs.

What is 3D clustering?

Within the scope of 3D Geodata, clustering algorithms (also defined as unsupervised segmentation) permit to obtain a segment soup that becomes the backbone of several processes such as feature extraction, classification or 3D modeling as illustrated below.

Can clustering be used for dimensionality reduction?

There are several ways to use clustering as dimension reduction. For the K-means, you can also project the points (orthogonally) onto the vector (or affine) space generated by the centres.


1 Answers

To give an introductory answer to my own question, I'll suggest that one can use the list of distances between each point in the shape as the metric to perform the clustering on.

Let's create some shapes:

shapes = np.array([[[1,4],[4,2],[11,2],[14,0]],
          [[4,5],[4,2],[11,2],[13,0]],
          [[1,3],[4,2],[11,2],[14,1.5]],
          [[3,5],[4,2],[10,7],[7,9]],
          [[5,5],[4,2],[10,7],[6,6]]])

def random_rotation():
    theta = 3 * np.pi * np.random.random()
    rotMatrix = numpy.array([[np.cos(theta), -np.sin(theta)], 
                             [np.sin(theta),  np.cos(theta)]])
    return rotMatrix

new_shapes = []
for s in shapes:
    rr = random_rotation()
    new_shapes += [[list(rr.dot(p)) + [0] for p in s]]
new_shapes = np.array(new_shapes)

for i, s in enumerate(new_shapes):
    plot(s[:,0], s[:,1], color='black')
    text(np.mean(s[:,0]), np.mean(s[:,1]), str(i), fontsize=14)

enter image description here

Then we create some auxiliary functions and create an array containing all the inter-vertex distances for each shape (darray).

import itertools as it

def vec_distance(v1, v2):
    '''
    The distance between two vectors.
    '''
    diff = v2 - v1
    return math.sqrt(sum(diff * diff))

def distances(s):
    '''
    Compute the distance array for a shape s.
    '''
    ds = [vec_distance(p1, p2) for p1,p2 in it.combinations(s, r=2)]

    return np.array(ds)


# create an array of inter-shape distances for each shape
darray = np.array([distances(s) for s in new_shapes])

Cluster them into two clusters using Pycluster.

import Pycluster as pc

clust = pc.kcluster(darray,2)
print clust

And see that we end up with three entries in the first cluster and two in the other.

(array([0, 0, 0, 1, 1], dtype=int32), 4.576996142441375, 1)

But which shapes do they correspond to?

import brewer2mpl

dark2 = brewer2mpl.get_map('Dark2', 'qualitative', 4).mpl_colors

for i,(s,c) in enumerate(zip(new_shapes, clust[0])):
    plot(s[:,0], s[:,1], color=dark2[c])
    text(np.mean(s[:,0]), np.mean(s[:,1]), str(i), fontsize=14)

Clustered Shapes

Looks good! The problem is that as shapes get larger, the distance array grows in quadratic time with respect to the number of vertices. I found a presentation that describes this problem and suggests some solutions (like SVD for what I presume is a form of dimensionality reduction) to speed it up.

I'm not going to accept my answer just yet because I'm interested in any other ideas or thoughts about how to approach this simple problem.

like image 87
juniper- Avatar answered Sep 28 '22 02:09

juniper-