Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trypsin digest (cleavage) does not work using regular expression

Tags:

python

I have trying to code the theoretical tryptic cleavage of protein sequences in Python. The cleavage rule for trypsin is: after R or K, but not before P. (i.e. the trypsin cleaves (cuts) the protein sequence after each K or R, unless (K or R) is followed by a P).

Example: Cleavage (cut) of the sequence MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK should result in these 4 sequences (peptides):

MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK 

Note that there is no cleavage after K in the second peptide (because P comes after K) and there is no cleavage after R in the third peptide (because P comes after R).

I have written this code in Python, but it doesn't work well. Is there any way to implement this regular expression more meaningfully?

    # Open the file and read it line by line.

    myprotein = open(raw_input('Enter input filename: '),'r')
    if  os.path.exists("trypsin_digest.txt"):
        os.remove("trypsin_digest.txt")
    outfile = open("trypsin_digest.txt",'w+')

    for line in myprotein:
        protein = line.rstrip()
        protein = re.sub('(?<=[RK])(?=[^P])','', protein)

    for peptide in protein:
        outfile.write(peptide)
    print 'results written to:\n', os.getcwd() +'\ trypsin_digest.txt'

This is how I got it to work for me

   myprotein = open(raw_input('Enter input filename: '),'r')
   my_protein = []

   for protein in myprotein:
   myprotein = protein.rstrip('\n')
   my_protein.append(myprotein)
   my_pro = (''.join(my_protein))

   #cleaves sequence    
   peptides = re.sub(r'(?<=[RK])(?=[^P])','\n', my_pro)
   print peptides

Protein Sequence:

MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK

Output(trypsin cleaved sites) or peptides

MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK

like image 436
user587646 Avatar asked May 29 '11 15:05

user587646


People also ask

Which amino acids does trypsin cleave?

Trypsin cleaves peptides on the C-terminal side of lysine and arginine amino acid residues. If a proline residue is on the carboxyl side of the cleavage site, the cleavage will not occur.

Where does trypsin cleave?

Trypsin cleaves specifically peptide bonds at the C-terminal side of lysine and arginine residues, except for -Arg-Pro- and -Lys-Pro- bonds which are normally resistant to proteolysis.

What trypsin digestion?

Trypsin is an enzyme that helps us digest protein. In the small intestine, trypsin breaks down proteins, continuing the process of digestion that began in the stomach. It may also be referred to as a proteolytic enzyme, or proteinase. Trypsin is produced by the pancreas in an inactive form called trypsinogen.


2 Answers

regexes are nice, but here's a solution that uses regular python. Since you're looking for subsequences in the bases, it makes sense to build this as a generator, which yields the fragments.

example = 'MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK'

def trypsin(bases):
    sub = ''
    while bases:
        k, r = bases.find('K'), bases.find('R')
        cut = min(k, r)+1 if k > 0 and r > 0 else max(k, r)+1
        sub += bases[:cut]
        bases = bases[cut:]
        if not bases or bases[0] != 'P':
            yield sub
            sub = ''


print list(trypsin(example))
like image 163
SingleNegationElimination Avatar answered Oct 15 '22 18:10

SingleNegationElimination


EDIT With a slight modification your regex works well:

In your comment you mentioned you have multiple sequences in a file (let's call it sequences.dat):

$ cat sequences.dat
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK

>>> with open('sequences.dat') as f:
    s = f.read()

>>> print(s)
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK

>>> protein = re.sub(r'(?<=[RK])(?=[^P])','\n', s, re.DOTALL)

>>> protein.split()
['MVPPPPSR', 'GGAAKPGQLGR', 'SLGPLLLLLRPEEPEDGDR', 'EICSESK', 'MVPPPPSR', 'GGAAKPGQLGR', 'SLGPLLLLLRPEEPEDGDR', 'EICSESK', 'MVPPPPSR', 'GGAAKPGQLGR', 'SLGPLLLLLRPEEPEDGDR', 'EICSESK']

>>> print protein
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK

MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK

MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK
like image 41
AnalyticsBuilder Avatar answered Oct 15 '22 19:10

AnalyticsBuilder