Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate protein contact order in Python [closed]

Tags:

python

numba

Protein contact order (CO) is of locality of inter-residue contacts. The CO is correlated also to the folding speed of proteins. Higher contact orders indicate longer folding times, and low contact order has been suggested as a predictor of potential downhill folding, or protein folding that occurs without a free energy barrier.

There is an inline webserver and a perl script to calculate the CO here.

Is there a way to calculate the CO in python?

like image 522
Ajasja Avatar asked Feb 06 '23 21:02

Ajasja


2 Answers

You can take advantage of the fact that you know your coordinate array will always be (N,3) in shape, so you can get rid of an array creation and calling np.dot, which is less efficient for really small arrays like the ones here. Doing so, I've re-written your function as abs_contact_order2:

from __future__ import print_function, division

import mdtraj as md
import numpy as np
import numba as nb

@nb.njit
def abs_contact_order(xyz, atoms_residue, cutoff_nm=.6):
    """Return the absolute contact order."""
    contact_count = 0
    seq_distance_sum = 0
    cutoff_2 = cutoff_nm*cutoff_nm
    N = len(atoms_residue)
    for i in xrange(N):
        for j in xrange(N):
            seq_dist = atoms_residue[j] - atoms_residue[i]
            if seq_dist > 0:
                d = xyz[j] - xyz[i]
                if np.dot(d, d) < cutoff_2:
                    seq_distance_sum += seq_dist 
                    contact_count += 1


    if contact_count==0.:
        return 0.

    return seq_distance_sum/float(contact_count)

@nb.njit
def abs_contact_order2(xyz, atoms_residue, cutoff_nm=.6):
    """Return the absolute contact order."""
    contact_count = 0
    seq_distance_sum = 0
    cutoff_2 = cutoff_nm*cutoff_nm
    N = len(atoms_residue)
    for i in xrange(N):
        for j in xrange(N):
            seq_dist = atoms_residue[j] - atoms_residue[i]
            if seq_dist > 0:
                d = 0.0
                for k in xrange(3):
                    d += (xyz[j,k] - xyz[i,k])**2
                if d < cutoff_2:
                    seq_distance_sum += seq_dist 
                    contact_count += 1


    if contact_count==0.:
        return 0.

    return seq_distance_sum/float(contact_count)

And then timings:

%timeit abs_co = abs_contact_order(traj.xyz[0], seq_atoms, cutoff_nm=0.60)
1 loop, best of 3: 723 ms per loop

%timeit abs_co = abs_contact_order2(traj.xyz[0], seq_atoms, cutoff_nm=0.60)
10 loops, best of 3: 28.2 ms per loop

I've taken the print statements out of the functions, which allows you to run the Numba jit in nopython mode. If you really need this information, I would return all of the necessary data and the write a thin wrapper that inspects the return values and prints any debug information as needed.

Also note, when timing Numba functions, you should "warm-up" the jit first by calling the function outside of the timing loop. Often if you are going to call the function many times, the time it takes Numba to jit the function is larger than the time for a single call, so you don't really get a good indication of how fast the code is. If however you are only going to call the function once and startup time is important, continue timing it as you are.

Update:

You can speed this up a little further since you're looping over (i,j) and (j,i) pairs and they are symmetric:

@nb.njit
def abs_contact_order3(xyz, atoms_residue, cutoff_nm=.6):
    """Return the absolute contact order."""
    contact_count = 0
    seq_distance_sum = 0
    cutoff_2 = cutoff_nm*cutoff_nm
    N = len(atoms_residue)
    for i in xrange(N):
        for j in xrange(i+1,N):
            seq_dist = atoms_residue[j] - atoms_residue[i]
            if seq_dist > 0:
                d = 0.0
                for k in xrange(3):
                    d += (xyz[j,k] - xyz[i,k])**2
                if d < cutoff_2:
                    seq_distance_sum += 2*seq_dist 
                    contact_count += 2


    if contact_count==0.:
        return 0.

    return seq_distance_sum/float(contact_count)

and timing:

%timeit abs_co = abs_contact_order3(traj.xyz[0], seq_atoms, cutoff_nm=0.60)
100 loops, best of 3: 15.7 ms per loop
like image 128
JoshAdel Avatar answered Feb 20 '23 06:02

JoshAdel


Indeed there is! Using the excellent MDTraj this is not a problem:

import numpy as np
import mdtraj as md

@jit
def abs_contact_order(xyz, atoms_residue, cutoff_nm=.6):
    """Return the absolute contact order."""
    contact_count = 0
    seq_distance_sum = 0
    cutoff_2 = cutoff_nm*cutoff_nm
    N = len(atoms_residue)
    for i in xrange(N):
        for j in xrange(N):
            seq_dist = atoms_residue[j] - atoms_residue[i]
            if seq_dist > 0:
                d = xyz[j] - xyz[i]
                if np.dot(d, d) < cutoff_2:
                    seq_distance_sum += seq_dist 
                    contact_count += 1


    if contact_count==0.:
        print("Warning, no contacts found!")
        return 0.

    print("contact_count: ", contact_count)
    print("seq_distance_sum: ", seq_distance_sum)
    return seq_distance_sum/float(contact_count)

Run using:

traj = md.load("test.pdb")
print("Atoms: ", traj.n_atoms)
seq_atoms = np.array([a.residue.resSeq for a in traj.top.atoms], dtype=int)

abs_co = abs_contact_order(traj.xyz[0], seq_atoms, cutoff_nm=0.60)

print("Absolute Contact Order: ", abs_co)
print("Relative Contact Order: ", abs_co/traj.n_residues)

Without numba this takes 16s and just by adding a single @jit the time is reduced to ~1s.

The original perl script takes about 8s

perl contactOrder.pl -r -a -c 6 test.pdb

The test file as well as an jupyter notebook is here.

like image 42
Ajasja Avatar answered Feb 20 '23 08:02

Ajasja