I am working on finding a pretty and pythonic way to find open reading frames in a DNA sequence. I have found many implementations online that make use of indexing, flags and other such ugliness.
I am pretty sure that a regular expression implementation can be created, but I am bad with regex. The general idea is that I want to split a string of DNA sequence by 'ATG', 'TAG', 'TGA' and 'TAA'. But I do not want to split on overlapping regions, for example the sequence 'ATGA' should be split into 'ATG','A'. Basically go from left to right in one of each of the three frames.
edit for clarity: As said in the comments, a sequence such as ATGATTTTGA
should be split into ATG
, TTT
, TGA
despite the presence of TGA
(which is in the non-zero frame)
edit2: this is how I have implemented it without regular expressions using the list comprehension splitting linked. I hate the use of flags though.
def find_orf(seq):
length = 0
stop = ['TAA','TGA','TAG']
for frame in range(3):
orfFlag, thisLen = None, 0
splitSeq = [seq[start+frame:start+frame+3] for start in range(0,len(seq),3)]
for codon in splitSeq:
if codon == 'ATG':
orfFlag = True
thisLen += 1
elif orfFlag and codon in stop:
orfFlag = None
if thisLen > length:
length = thisLen
else:
thisLen += 1
return length
I am not sure your suggested regex method is particularly pythonic, but the basic regex:
import re
v=re.compile("((ATG)|(TGA)|(TAG)|(TAA))")
test="CCATGACCCATGCACCATTGAC"
for i in v.findall(test):
print i
Does miss the first TGA that is part of ATGA and only reports the second. In general though this won't work as you will have to assume the frame of your gene, which likely is not known ahead of time.
A very readable way would be to simply do list comprehensions over all three reading frames.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With