This is for a homework assignment, I solved the problem but I'm trying to find a faster solution.
The problem is as follows: I need to figure out how many possible aminoacid (aa) sequences exist that have a total mass m. I have a table of aminoacids (single letter strings) and their corresponding mass (int) which I put in a dictionary.
My initial solution was to create all possible combinations of aa and compare each combination's total mass to the mass m. This works for small numbers of m but the number of combinations gets ridiculously high when m starts being in the hundreds.
I did some minor optimization and got it to work reasonably fast for m < 500 which is good enough for this problem but I want to know how to make it work for higher masses.
This is what I have so far:
totalmass = m
def pepList():
tempList = ['']
temp2List = []
length = 0
total = 0
aminoList = 'GASPVTCINDKEMHFRYW' #this are all the aminoacids
while length < maxLength:
for i in tempList:
for j in aminoList:
pepMass = peptideMass(i+j, massTable) #find the mass of
#this peptide
if pepMass == totalmass:
total += 1
elif pepMass <= totalmass:
temp2List.append(i+j)
tempList = []
for i in temp2List:
tempList.append(i)
temp2List = []
length = length + 1
print (total)
pepList()
I can get a solution for m = 300 in about a second but m = 500 takes about 40 seconds
I tried an alternative using itertools but it wasn't any faster:
total = 0
pepList = []
for i in range(maxLength+1):
for p in itertools.combinations_with_replacement(aminoList, i):
#order matters for the total number of peptides but not for calculating
#the total mass
amino = ''.join(p)
if peptideMass(amino, massTable) == mass:
pepList.append(amino)
print (len(pepList))
newpepList = []
for i in pepList:
for p in itertools.permutations(i, r = len(i)):
#I use permutations here to get the total number because order matters
if p not in newpepList:
newpepList.append(p)
total +=1
print (total)
Sample input: m = 270 output: 22
The order in which the amino acids occur does not change the mass - AAC weighs the same as ACA and CAA.
Therefore this could be reduced to a linear programming problem - find the values of the coefficients such that M = a*A + b*C + c*D + d*E + e*G + ... + r*W
Once you have a solution, you can generate all possible permutations of the given set of amino acids - or if you just need the number of permutations, you can calculate it directly.
Edit:
As @Hooked points out, this is not Linear Programming for two reasons: first, we require integer coefficients, and second, we are looking for all combinations, not finding a single optimal solution.
I've worked out a recursive generator as follows:
from math import floor, ceil
import profile
amino_weight = {
'A': 71.038,
'C': 103.009,
'D': 115.027,
'E': 129.043,
'F': 147.068,
'G': 57.021,
'H': 137.059,
'I': 113.084,
'K': 128.095,
'L': 113.084, # you omitted leutine?
'M': 131.040,
'N': 114.043,
'P': 97.053,
'Q': 128.059, # you omitted glutamine?
'R': 156.101,
'S': 87.032,
'T': 101.048,
'V': 99.068,
'W': 186.079,
'Y': 163.063
}
def get_float(prompt):
while True:
try:
return float(raw_input(prompt))
except ValueError:
pass
# This is where the fun happens!
def get_mass_combos(aminos, pos, lo, hi, cutoff):
this = aminos[pos] # use a pointer into the string, to avoid copying 8 million partial strings around
wt = amino_weight[this]
kmax = int(floor(hi / wt))
npos = pos - 1
if npos: # more aminos to consider recursively
for k in xrange(0, kmax + 1):
mass = k * wt
nlo = lo - mass
nhi = hi - mass
ncutoff = cutoff - mass
if nlo <= 0. and nhi >= 0.:
# we found a winner!
yield {this: k}
elif ncutoff < 0.:
# no further solution is possible
break
else:
# recurse
for cc in get_mass_combos(aminos, npos, nlo, nhi, ncutoff):
if k > 0: cc[this] = k
yield cc
else: # last amino - it's this or nothing
kmin = int(ceil(lo / wt))
for k in xrange(kmin, kmax+1):
yield {this: k}
def to_string(combo):
keys = sorted(combo)
return ''.join(k*combo[k] for k in keys)
def total_mass(combo):
return sum(amino_weight[a]*n for a,n in combo.items())
def fact(n):
num = 1
for i in xrange(2, n+1):
num *= i
return num
def permutations(combo):
num = 0
div = 1
for v in combo.values():
num += v
div *= fact(v)
return fact(num) / div
def find_combos(lo, hi):
total = 0
bases = []
aminos = ''.join(sorted(amino_weight, key = lambda x: amino_weight[x]))
for combo in get_mass_combos(aminos, len(aminos)-1, lo, hi, hi - amino_weight[aminos[0]]):
base = to_string(combo)
bases.append(base)
mass = total_mass(combo)
cc = permutations(combo)
total += cc
print("{} (mass {}, {} permutations)".format(base, mass, cc))
print('Total: {} bases, {} permutations'.format(len(bases), total))
def main():
lo = get_float('Bottom of target mass range? ')
hi = get_float('Top of target mass range? ')
prof = profile.Profile()
prof.run('find_combos({}, {})'.format(lo, hi))
prof.print_stats()
if __name__=="__main__":
main()
It also looks for a mass range, using floating-points amino masses. Searching for a mass between 748.0 and 752.0 returns 7,505 bases totaling 9,400,528 permutations in 3.82 seconds on my machine (i5-870).
Inspired by Hugh's answer below: here is a solution using numpy
. It always calculates all combinations, therefore it uses a lot of memory, but it has linear complexity.
The idea is to store all possible coeficient arrays in a numpy array (C
), and use numpy.dot
to generate the sum of masses for each coeficient array, which is then stored in the array PROD
.
The problem is then finding which elements of PROD
contain the desired mass M
, and work back (based on the element index in PROD
) the actual PEP_NAMES
It would be interesting to know if it actually produces the correct result:
import sys
import numpy as np
def main():
try:
M = int(sys.argv[1])
except Exception:
print "Usage: %s M" % sys.argv[0]
print " M = total mass"
sys.exit()
PEP_NAMES = ['G', 'A', 'S', 'P', 'V', 'T', 'C', 'I', 'N', 'D', 'K', 'E', 'M', 'H', 'F', 'R', 'Y', 'W']
# random data
PEP_MASSES = np.array([ 71, 99, 14, 37, 61, 63, 83, 3, 52, 43, 2, 80, 18, 37, 56, 36, 96, 13])
LEN_PEP_MASSES = len(PEP_MASSES)
NUM_COMB = 2**LEN_PEP_MASSES-1
# numpy array containing all possible coeficients
C = np.array([[int(x) for x in np.binary_repr(K, width=LEN_PEP_MASSES)] for K in xrange(NUM_COMB)])
# each element is an array of coefficients representing a number between 0 and NUM_COMB in binary form
print "type(C) = %s" % type(C)
print "type(C[0]) = %s" % type(C[0])
print "C.shape = %s" % str(C.shape)
print "C[0].shape = %s" % str(C[0].shape)
print "C[0] = %s" % C[0]
print "C[15] = %s" % C[15]
print "C[255] = %s" % C[255]
# Calculate sum of all combinations
PROD = C.dot(PEP_MASSES)
# find the ones that match M
valid_combinations = [(i,x) for i,x in enumerate(PROD) if x == M]
print 'Found %d possibilities with total mass = %d:' % (len(valid_combinations), M)
print valid_combinations
for comb_index, comb_mass in valid_combinations:
# work back the combinations in string format
comb_str = [PEP_NAMES[i] for i,x in enumerate(C[comb_index]) if x==1]
print '%10d --> %s' % (comb_index, ''.join(comb_str))
if __name__ == '__main__':
main()
Sample output:
python test.py 750
type(C) = <type 'numpy.ndarray'>
type(C[0]) = <type 'numpy.ndarray'>
C.shape = (262143, 18)
C[0].shape = (18,)
C[0] = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
C[15] = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]
C[255] = [0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1]
Found 24 possibilities with total mass = 750:
[(130815, 750), (196478, 750), (204671, 750), (208895, 750), (212575, 750), (220155, 750), (221039, 750), (225263, 750), (227455, 750), (228059, 750), (228943, 750), (229151, 750), (236542, 750), (244446, 750), (244695, 750), (252910, 750), (257914, 750), (260062, 750), (260814, 750), (260988, 750), (261022, 750), (261063, 750), (261750, 750), (262109, 750)]
130815 --> ASPVTCINKEMHFRYW
196478 --> GSPVTCINDEMHFRY
204671 --> GATCINDEMHFRYW
208895 --> GAVCINDKEMHFRYW
212575 --> GAVTCINEHFRYW
220155 --> GAPTCNDKEMHFYW
221039 --> GAPTCINDEMFRYW
225263 --> GAPVCINDKEMFRYW
227455 --> GAPVTCEMHFRYW
228059 --> GAPVTCNKEHFYW
228943 --> GAPVTCINEFRYW
229151 --> GAPVTCINDHFRYW
236542 --> GASTCNDKEMHFRY
244446 --> GASVTCNKEHFRY
244695 --> GASVTCNDKEHRYW
252910 --> GASPTCNDKEMFRY
257914 --> GASPVCINDEMHFY
260062 --> GASPVTINDKEHFRY
260814 --> GASPVTCNKEFRY
260988 --> GASPVTCNDEMHFR
261022 --> GASPVTCNDKHFRY
261063 --> GASPVTCNDKERYW
261750 --> GASPVTCINEMHRY
262109 --> GASPVTCINDKEHFRW
It takes about 15 seconds to run on my laptop.
Note that it gives you all combinations (!) (i.e. the order of elements is not important). If you need all permutations, you just need to iterate over each result and generate them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With