Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm optimization to find possible aminoacid sequences with total mass m [duplicate]

This is for a homework assignment, I solved the problem but I'm trying to find a faster solution.

The problem is as follows: I need to figure out how many possible aminoacid (aa) sequences exist that have a total mass m. I have a table of aminoacids (single letter strings) and their corresponding mass (int) which I put in a dictionary.

My initial solution was to create all possible combinations of aa and compare each combination's total mass to the mass m. This works for small numbers of m but the number of combinations gets ridiculously high when m starts being in the hundreds.

I did some minor optimization and got it to work reasonably fast for m < 500 which is good enough for this problem but I want to know how to make it work for higher masses.

This is what I have so far:

totalmass = m

def pepList():
    tempList = ['']
    temp2List = []
    length = 0
    total = 0
    aminoList = 'GASPVTCINDKEMHFRYW'  #this are all the aminoacids

    while length < maxLength:
        for i in tempList:
            for j in aminoList:
                pepMass = peptideMass(i+j, massTable) #find the mass of 
                                                      #this peptide
                if pepMass == totalmass:
                    total += 1
                elif pepMass <= totalmass:
                    temp2List.append(i+j)


        tempList = []
        for i in temp2List:
            tempList.append(i)
        temp2List = []
        length = length + 1

    print (total)

pepList()

I can get a solution for m = 300 in about a second but m = 500 takes about 40 seconds

I tried an alternative using itertools but it wasn't any faster:

total = 0
pepList = []

for i in range(maxLength+1):
    for p in itertools.combinations_with_replacement(aminoList, i): 
    #order matters for the total number of peptides but not for calculating 
    #the total mass
        amino = ''.join(p)
        if peptideMass(amino, massTable) == mass:
            pepList.append(amino)

print (len(pepList))

newpepList = []

for i in pepList:

    for p in itertools.permutations(i, r = len(i)): 
    #I use permutations here to get the total number because order matters
        if p not in newpepList:
            newpepList.append(p)

            total +=1

print (total)

Sample input: m = 270 output: 22

like image 409
Fostire Avatar asked Nov 25 '13 15:11

Fostire


2 Answers

The order in which the amino acids occur does not change the mass - AAC weighs the same as ACA and CAA.

Therefore this could be reduced to a linear programming problem - find the values of the coefficients such that M = a*A + b*C + c*D + d*E + e*G + ... + r*W

Once you have a solution, you can generate all possible permutations of the given set of amino acids - or if you just need the number of permutations, you can calculate it directly.

Edit:

As @Hooked points out, this is not Linear Programming for two reasons: first, we require integer coefficients, and second, we are looking for all combinations, not finding a single optimal solution.

I've worked out a recursive generator as follows:

from math import floor, ceil
import profile

amino_weight = {
    'A':  71.038,
    'C': 103.009,
    'D': 115.027,
    'E': 129.043,
    'F': 147.068,
    'G':  57.021,
    'H': 137.059,
    'I': 113.084,
    'K': 128.095,
    'L': 113.084,   # you omitted leutine?
    'M': 131.040,
    'N': 114.043,
    'P':  97.053,
    'Q': 128.059,   # you omitted glutamine?
    'R': 156.101,
    'S':  87.032,
    'T': 101.048,
    'V':  99.068,
    'W': 186.079,
    'Y': 163.063
}

def get_float(prompt):
    while True:
        try:
            return float(raw_input(prompt))
        except ValueError:
            pass

# This is where the fun happens!
def get_mass_combos(aminos, pos, lo, hi, cutoff):
    this = aminos[pos]         # use a pointer into the string, to avoid copying 8 million partial strings around
    wt = amino_weight[this]
    kmax = int(floor(hi / wt))
    npos = pos - 1
    if npos:                   # more aminos to consider recursively
        for k in xrange(0, kmax + 1):
            mass    = k * wt
            nlo     = lo - mass
            nhi     = hi - mass
            ncutoff = cutoff - mass
            if nlo <= 0. and nhi >= 0.:
                # we found a winner!
                yield {this: k}
            elif ncutoff < 0.:
                # no further solution is possible
                break
            else:
                # recurse
                for cc in get_mass_combos(aminos, npos, nlo, nhi, ncutoff):
                    if k > 0: cc[this] = k
                    yield cc
    else:                      # last amino - it's this or nothing
        kmin = int(ceil(lo / wt))
        for k in xrange(kmin, kmax+1):
            yield {this: k}

def to_string(combo):
    keys = sorted(combo)
    return ''.join(k*combo[k] for k in keys)

def total_mass(combo):
    return sum(amino_weight[a]*n for a,n in combo.items())

def fact(n):
    num = 1
    for i in xrange(2, n+1):
        num *= i
    return num

def permutations(combo):
    num = 0
    div = 1
    for v in combo.values():
        num += v
        div *= fact(v)
    return fact(num) / div

def find_combos(lo, hi):
    total = 0
    bases = []
    aminos = ''.join(sorted(amino_weight, key = lambda x: amino_weight[x]))
    for combo in get_mass_combos(aminos, len(aminos)-1, lo, hi, hi - amino_weight[aminos[0]]):
        base = to_string(combo)
        bases.append(base)
        mass = total_mass(combo)
        cc = permutations(combo)
        total += cc
        print("{} (mass {}, {} permutations)".format(base, mass, cc))
    print('Total: {} bases, {} permutations'.format(len(bases), total))

def main():
    lo = get_float('Bottom of target mass range? ')
    hi = get_float('Top of target mass range? ')

    prof = profile.Profile()
    prof.run('find_combos({}, {})'.format(lo, hi))
    prof.print_stats()

if __name__=="__main__":
    main()

It also looks for a mass range, using floating-points amino masses. Searching for a mass between 748.0 and 752.0 returns 7,505 bases totaling 9,400,528 permutations in 3.82 seconds on my machine (i5-870).

like image 168
Hugh Bothwell Avatar answered Nov 19 '22 08:11

Hugh Bothwell


Inspired by Hugh's answer below: here is a solution using numpy. It always calculates all combinations, therefore it uses a lot of memory, but it has linear complexity.

The idea is to store all possible coeficient arrays in a numpy array (C), and use numpy.dot to generate the sum of masses for each coeficient array, which is then stored in the array PROD.

The problem is then finding which elements of PROD contain the desired mass M, and work back (based on the element index in PROD) the actual PEP_NAMES

It would be interesting to know if it actually produces the correct result:

import sys
import numpy as np

def main():
    try:
        M = int(sys.argv[1])
    except Exception:
        print "Usage: %s M" % sys.argv[0]
        print "    M = total mass"
        sys.exit()

    PEP_NAMES      =          ['G', 'A', 'S', 'P', 'V', 'T', 'C', 'I', 'N', 'D', 'K', 'E', 'M', 'H', 'F', 'R', 'Y', 'W']
    # random data
    PEP_MASSES     = np.array([ 71,  99,  14,  37,  61,  63,  83,   3,  52,  43,   2,  80,  18,  37,  56,  36,  96,  13])
    LEN_PEP_MASSES = len(PEP_MASSES)
    NUM_COMB       = 2**LEN_PEP_MASSES-1

    # numpy array containing all possible coeficients
    C = np.array([[int(x) for x in np.binary_repr(K, width=LEN_PEP_MASSES)] for K in xrange(NUM_COMB)])
    # each element is an array of coefficients representing a number between 0 and NUM_COMB in binary form
    print "type(C)      = %s" % type(C)
    print "type(C[0])   = %s" % type(C[0])
    print "C.shape      = %s" % str(C.shape)
    print "C[0].shape   = %s" % str(C[0].shape)
    print "C[0]         = %s" % C[0]
    print "C[15]        = %s" % C[15]
    print "C[255]       = %s" % C[255]

    # Calculate sum of all combinations
    PROD = C.dot(PEP_MASSES)

    # find the ones that match M
    valid_combinations = [(i,x) for i,x in enumerate(PROD) if x == M]
    print 'Found %d possibilities with total mass = %d:' % (len(valid_combinations), M)
    print valid_combinations
    for comb_index, comb_mass in valid_combinations:
        # work back the combinations in string format
        comb_str = [PEP_NAMES[i] for i,x in enumerate(C[comb_index]) if x==1]
        print '%10d --> %s' % (comb_index, ''.join(comb_str))

if __name__ == '__main__':
    main()

Sample output:

python test.py 750
type(C)      = <type 'numpy.ndarray'>
type(C[0])   = <type 'numpy.ndarray'>
C.shape      = (262143, 18)
C[0].shape   = (18,)
C[0]         = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
C[15]        = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]
C[255]       = [0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1]
Found 24 possibilities with total mass = 750:
[(130815, 750), (196478, 750), (204671, 750), (208895, 750), (212575, 750), (220155, 750), (221039, 750), (225263, 750), (227455, 750), (228059, 750), (228943, 750), (229151, 750), (236542, 750), (244446, 750), (244695, 750), (252910, 750), (257914, 750), (260062, 750), (260814, 750), (260988, 750), (261022, 750), (261063, 750), (261750, 750), (262109, 750)]
    130815 --> ASPVTCINKEMHFRYW
    196478 --> GSPVTCINDEMHFRY
    204671 --> GATCINDEMHFRYW
    208895 --> GAVCINDKEMHFRYW
    212575 --> GAVTCINEHFRYW
    220155 --> GAPTCNDKEMHFYW
    221039 --> GAPTCINDEMFRYW
    225263 --> GAPVCINDKEMFRYW
    227455 --> GAPVTCEMHFRYW
    228059 --> GAPVTCNKEHFYW
    228943 --> GAPVTCINEFRYW
    229151 --> GAPVTCINDHFRYW
    236542 --> GASTCNDKEMHFRY
    244446 --> GASVTCNKEHFRY
    244695 --> GASVTCNDKEHRYW
    252910 --> GASPTCNDKEMFRY
    257914 --> GASPVCINDEMHFY
    260062 --> GASPVTINDKEHFRY
    260814 --> GASPVTCNKEFRY
    260988 --> GASPVTCNDEMHFR
    261022 --> GASPVTCNDKHFRY
    261063 --> GASPVTCNDKERYW
    261750 --> GASPVTCINEMHRY
    262109 --> GASPVTCINDKEHFRW

It takes about 15 seconds to run on my laptop.

Note that it gives you all combinations (!) (i.e. the order of elements is not important). If you need all permutations, you just need to iterate over each result and generate them.

like image 29
E.Z. Avatar answered Nov 19 '22 10:11

E.Z.