I have trying to code the theoretical tryptic cleavage of protein sequences in Python. The cleavage rule for trypsin is: after R or K, but not before P. (i.e. the trypsin cleaves (cuts) the protein sequence after each K or R, unless (K or R) is followed by a P).
Example: Cleavage (cut) of the sequence MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
should result in these 4 sequences (peptides):
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK
Note that there is no cleavage after K in the second peptide (because P comes after K) and there is no cleavage after R in the third peptide (because P comes after R).
I have written this code in Python, but it doesn't work well. Is there any way to implement this regular expression more meaningfully?
# Open the file and read it line by line.
myprotein = open(raw_input('Enter input filename: '),'r')
if os.path.exists("trypsin_digest.txt"):
os.remove("trypsin_digest.txt")
outfile = open("trypsin_digest.txt",'w+')
for line in myprotein:
protein = line.rstrip()
protein = re.sub('(?<=[RK])(?=[^P])','', protein)
for peptide in protein:
outfile.write(peptide)
print 'results written to:\n', os.getcwd() +'\ trypsin_digest.txt'
This is how I got it to work for me
myprotein = open(raw_input('Enter input filename: '),'r')
my_protein = []
for protein in myprotein:
myprotein = protein.rstrip('\n')
my_protein.append(myprotein)
my_pro = (''.join(my_protein))
#cleaves sequence
peptides = re.sub(r'(?<=[RK])(?=[^P])','\n', my_pro)
print peptides
Protein Sequence:
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
Output(trypsin cleaved sites) or peptides
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK
Trypsin cleaves peptides on the C-terminal side of lysine and arginine amino acid residues. If a proline residue is on the carboxyl side of the cleavage site, the cleavage will not occur.
Trypsin cleaves specifically peptide bonds at the C-terminal side of lysine and arginine residues, except for -Arg-Pro- and -Lys-Pro- bonds which are normally resistant to proteolysis.
Trypsin is an enzyme that helps us digest protein. In the small intestine, trypsin breaks down proteins, continuing the process of digestion that began in the stomach. It may also be referred to as a proteolytic enzyme, or proteinase. Trypsin is produced by the pancreas in an inactive form called trypsinogen.
regexes are nice, but here's a solution that uses regular python. Since you're looking for subsequences in the bases, it makes sense to build this as a generator, which yields the fragments.
example = 'MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK'
def trypsin(bases):
sub = ''
while bases:
k, r = bases.find('K'), bases.find('R')
cut = min(k, r)+1 if k > 0 and r > 0 else max(k, r)+1
sub += bases[:cut]
bases = bases[cut:]
if not bases or bases[0] != 'P':
yield sub
sub = ''
print list(trypsin(example))
EDIT With a slight modification your regex works well:
In your comment you mentioned you have multiple sequences in a file (let's call it sequences.dat):
$ cat sequences.dat
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
>>> with open('sequences.dat') as f:
s = f.read()
>>> print(s)
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
>>> protein = re.sub(r'(?<=[RK])(?=[^P])','\n', s, re.DOTALL)
>>> protein.split()
['MVPPPPSR', 'GGAAKPGQLGR', 'SLGPLLLLLRPEEPEDGDR', 'EICSESK', 'MVPPPPSR', 'GGAAKPGQLGR', 'SLGPLLLLLRPEEPEDGDR', 'EICSESK', 'MVPPPPSR', 'GGAAKPGQLGR', 'SLGPLLLLLRPEEPEDGDR', 'EICSESK']
>>> print protein
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With