Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Python regular expressions to split on non-overlapping character groups (ORF finding)

Tags:

python

regex

I am working on finding a pretty and pythonic way to find open reading frames in a DNA sequence. I have found many implementations online that make use of indexing, flags and other such ugliness.

I am pretty sure that a regular expression implementation can be created, but I am bad with regex. The general idea is that I want to split a string of DNA sequence by 'ATG', 'TAG', 'TGA' and 'TAA'. But I do not want to split on overlapping regions, for example the sequence 'ATGA' should be split into 'ATG','A'. Basically go from left to right in one of each of the three frames.

edit for clarity: As said in the comments, a sequence such as ATGATTTTGA should be split into ATG, TTT, TGA despite the presence of TGA (which is in the non-zero frame)

edit2: this is how I have implemented it without regular expressions using the list comprehension splitting linked. I hate the use of flags though.

def find_orf(seq):
    length = 0
    stop = ['TAA','TGA','TAG']
    for frame in range(3):
        orfFlag, thisLen = None, 0
        splitSeq = [seq[start+frame:start+frame+3] for start in range(0,len(seq),3)]
        for codon in splitSeq:
            if codon == 'ATG':
                orfFlag = True
                thisLen += 1
            elif orfFlag and codon in stop:
                orfFlag = None
                if thisLen > length:
                    length = thisLen
            else:
                thisLen += 1
    return length
like image 552
Ian Fiddes Avatar asked Nov 02 '22 11:11

Ian Fiddes


1 Answers

I am not sure your suggested regex method is particularly pythonic, but the basic regex:

import re
v=re.compile("((ATG)|(TGA)|(TAG)|(TAA))")
test="CCATGACCCATGCACCATTGAC"
for i in v.findall(test):
   print i

Does miss the first TGA that is part of ATGA and only reports the second. In general though this won't work as you will have to assume the frame of your gene, which likely is not known ahead of time.

A very readable way would be to simply do list comprehensions over all three reading frames.

like image 82
evolvedmicrobe Avatar answered Nov 15 '22 04:11

evolvedmicrobe