Trypsin digest (cleavage) does not work using regular expression

Tags:

python

I have trying to code the theoretical tryptic cleavage of protein sequences in Python. The cleavage rule for trypsin is: after R or K, but not before P. (i.e. the trypsin cleaves (cuts) the protein sequence after each K or R, unless (K or R) is followed by a P).

Example: Cleavage (cut) of the sequence MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK should result in these 4 sequences (peptides):

MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK

Note that there is no cleavage after K in the second peptide (because P comes after K) and there is no cleavage after R in the third peptide (because P comes after R).

I have written this code in Python, but it doesn't work well. Is there any way to implement this regular expression more meaningfully?

    # Open the file and read it line by line.

    myprotein = open(raw_input('Enter input filename: '),'r')
    if  os.path.exists("trypsin_digest.txt"):
        os.remove("trypsin_digest.txt")
    outfile = open("trypsin_digest.txt",'w+')

    for line in myprotein:
        protein = line.rstrip()
        protein = re.sub('(?<=[RK])(?=[^P])','', protein)

    for peptide in protein:
        outfile.write(peptide)
    print 'results written to:\n', os.getcwd() +'\ trypsin_digest.txt'

This is how I got it to work for me

   myprotein = open(raw_input('Enter input filename: '),'r')
   my_protein = []

   for protein in myprotein:
   myprotein = protein.rstrip('\n')
   my_protein.append(myprotein)
   my_pro = (''.join(my_protein))

   #cleaves sequence    
   peptides = re.sub(r'(?<=[RK])(?=[^P])','\n', my_pro)
   print peptides

Protein Sequence:

MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK

Output(trypsin cleaved sites) or peptides

MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK

436

asked May 29 '11 15:05

user587646

2 Answers

regexes are nice, but here's a solution that uses regular python. Since you're looking for subsequences in the bases, it makes sense to build this as a generator, which yields the fragments.

example = 'MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK'

def trypsin(bases):
    sub = ''
    while bases:
        k, r = bases.find('K'), bases.find('R')
        cut = min(k, r)+1 if k > 0 and r > 0 else max(k, r)+1
        sub += bases[:cut]
        bases = bases[cut:]
        if not bases or bases[0] != 'P':
            yield sub
            sub = ''


print list(trypsin(example))

163

answered Oct 15 '22 18:10

SingleNegationElimination

EDIT With a slight modification your regex works well:

In your comment you mentioned you have multiple sequences in a file (let's call it sequences.dat):

$ cat sequences.dat
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK

>>> with open('sequences.dat') as f:
    s = f.read()

>>> print(s)
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK

>>> protein = re.sub(r'(?<=[RK])(?=[^P])','\n', s, re.DOTALL)

>>> protein.split()
['MVPPPPSR', 'GGAAKPGQLGR', 'SLGPLLLLLRPEEPEDGDR', 'EICSESK', 'MVPPPPSR', 'GGAAKPGQLGR', 'SLGPLLLLLRPEEPEDGDR', 'EICSESK', 'MVPPPPSR', 'GGAAKPGQLGR', 'SLGPLLLLLRPEEPEDGDR', 'EICSESK']

>>> print protein
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK

MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK

MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK

answered Oct 15 '22 19:10

AnalyticsBuilder

Related questions
                            
                                How to make python autocompletion display matches?
                            
                                How to produce an exponentially scaled axis?
                            
                                OpenCV + python -- grab frames from a video file
                            
                                Get starred messages from GMail using IMAP4 and python
                            
                                Why does float() fail to convert my string to a float?
                            
                                Regex: How to match sequence of key-value pairs at end of string
                            
                                Why is Paramiko raising EOFError() when the SFTP object is stored in a dictionary?
                            
                                BigInteger in SQLAlchemy or not?
                            
                                Pyusb on Windows 7 cannot find any devices
                            
                                Listing indices using sqlalchemy
                            
                                How to add Python plug-in to Gnu Global
                            
                                how to use two level proxy setting in Python?
                            
                                python: use windows api to render text using a ttf font
                            
                                Python multiprocessing: synchronizing file-like object
                            
                                Building an MS Access database using python
                            
                                ipython and fork()
                            
                                Using Python 3.1 and 2.5 together
                            
                                Quickly Find the Index in an Array Closest to Some Value
                            
                                How to set a file's ctime with Python? [duplicate]
                            
                                Is it possible to get a "high water mark" of memory usage from Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With