Clustering structural 3D data

Tags:

Let's say I have a number of objects (similar to proteins, but not exactly), each of which is represented by a vector of n 3D coordinates. Each of these objects is oriented somewhere in space. Their similarity can be calculated by aligning them using the Kabsch Algorithm and calculating the root-mean-square deviation of the aligned coordinates.

My question is, what would be the recommended way of clustering a large set of these structures in such a fashion as to extract the most populated cluster (i.e. the one to which most structures belong). Also, is there a way of doing this in python. By way of example, here's a trivialized set of unclustered structures (each is represented by the coordinates of the four vertices):

enter image description here

And then the desired clustering (using two clusters):

enter image description here

I've tried aligning all of the structures to a reference structure (i.e. the first structure) and then performing k-means on the distances between the reference and aligned coordinate using Pycluster.kcluster, but this seems a little bit clumsy and doesn't work so well. The structures in each cluster don't end up being very similar to each other. Ideally this clustering wouldn't be done on the difference vectors, but rather on the actual structures themselves, but the structures have dimensions (n,3) rather than the (n,) required for k-means clustering.

The other option I tried is scipy.clustering.hierarchical. This seems to work pretty well, but I'm having trouble deciding which cluster is the most populated since one can always find a more populated cluster by moving up to the next branch of the tree.

Any thoughts or suggestions or ideas about different (already implemented in python) clustering algorithms would be greatly appreciated.

549

asked Mar 14 '14 09:03

juniper-

1 Answers

To give an introductory answer to my own question, I'll suggest that one can use the list of distances between each point in the shape as the metric to perform the clustering on.

Let's create some shapes:

shapes = np.array([[[1,4],[4,2],[11,2],[14,0]],
          [[4,5],[4,2],[11,2],[13,0]],
          [[1,3],[4,2],[11,2],[14,1.5]],
          [[3,5],[4,2],[10,7],[7,9]],
          [[5,5],[4,2],[10,7],[6,6]]])

def random_rotation():
    theta = 3 * np.pi * np.random.random()
    rotMatrix = numpy.array([[np.cos(theta), -np.sin(theta)], 
                             [np.sin(theta),  np.cos(theta)]])
    return rotMatrix

new_shapes = []
for s in shapes:
    rr = random_rotation()
    new_shapes += [[list(rr.dot(p)) + [0] for p in s]]
new_shapes = np.array(new_shapes)

for i, s in enumerate(new_shapes):
    plot(s[:,0], s[:,1], color='black')
    text(np.mean(s[:,0]), np.mean(s[:,1]), str(i), fontsize=14)

enter image description here

Then we create some auxiliary functions and create an array containing all the inter-vertex distances for each shape (darray).

import itertools as it

def vec_distance(v1, v2):
    '''
    The distance between two vectors.
    '''
    diff = v2 - v1
    return math.sqrt(sum(diff * diff))

def distances(s):
    '''
    Compute the distance array for a shape s.
    '''
    ds = [vec_distance(p1, p2) for p1,p2 in it.combinations(s, r=2)]

    return np.array(ds)


# create an array of inter-shape distances for each shape
darray = np.array([distances(s) for s in new_shapes])

Cluster them into two clusters using Pycluster.

import Pycluster as pc

clust = pc.kcluster(darray,2)
print clust

And see that we end up with three entries in the first cluster and two in the other.

(array([0, 0, 0, 1, 1], dtype=int32), 4.576996142441375, 1)

But which shapes do they correspond to?

import brewer2mpl

dark2 = brewer2mpl.get_map('Dark2', 'qualitative', 4).mpl_colors

for i,(s,c) in enumerate(zip(new_shapes, clust[0])):
    plot(s[:,0], s[:,1], color=dark2[c])
    text(np.mean(s[:,0]), np.mean(s[:,1]), str(i), fontsize=14)

Clustered Shapes

Looks good! The problem is that as shapes get larger, the distance array grows in quadratic time with respect to the number of vertices. I found a presentation that describes this problem and suggests some solutions (like SVD for what I presume is a form of dimensionality reduction) to speed it up.

I'm not going to accept my answer just yet because I'm interested in any other ideas or thoughts about how to approach this simple problem.

answered Sep 28 '22 02:09

juniper-

Related questions
                            
                                import python class into another folder in a package
                            
                                Python CFFI module fails when loading dll: OSError 0x7e
                            
                                Reverse Not Found Error
                            
                                Django 1.6 urls not working
                            
                                Python script to charge and discharge laptop battery
                            
                                Memoizing all methods by default
                            
                                How do I get the serving URL for Google Cloud Storage file objects on App Engine dev server?
                            
                                NLTK - Automatically translating similar words
                            
                                How should I move blobs from BlobStore over to Google Cloud Storage?
                            
                                Passing custom parameter to scrapy request
                            
                                How to add ipython syntax highlighting to pygmentize?
                            
                                Reduce memory footprint of python program
                            
                                Parse Location, Person name, Date from string by NLTK
                            
                                Python input box inside a message box
                            
                                In python PIP, how can I make files in my private pip index "secure and verifiable"?
                            
                                Monotonically decreasing curve fit using Python
                            
                                Best way to include third party dependencies in Python app
                            
                                Matplotlib: WebAgg backend doesn't show any figures
                            
                                How can I temporarily redirect the output of logging in Python?
                            
                                Method of moments in scipy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Clustering structural 3D data

Tags:

python

cluster-analysis

juniper-

People also ask

1 Answers

juniper-

Recent Activity

Donate For Us