In eukaryotes spliced mRNA has three key properties:
So basically, an mRNA sequence should start with ATG, be followed by any number of As, Cs, Ts or Gs, then TAA or TAG or TGA, then 5 or more As.
My (python) regex is this: ^ATG[ATCG]*T(AA|AG|GA)A{5}A*$
However, this is matching sequences which have further characters after the poly(A) tail as if the $
character is not being recognized. What am I doing wrong?
Valid Examples:
ATGCTGATGATGATGATAGAAAAA
ATGTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Invalid Examples:
ATGCTGATGXTGATGATAGAAAAA
TATGCTGATGXTGATGATAGAAAAA
ATGTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
EDIT (My full code):
file = open('potential_mRNA.fasta')
alignment = SeqIO.parse(file, 'fasta')
mRNA_seqs = []
mRNA_pattern = r'^ATG[ATCG]*T(AA|AG|GA)A{5}A*$'
for mrna in alignment:
sequence = str(mrna.seq)
if re.search(mRNA_pattern, sequence):
mRNA_seqs.append(sequence)
It works that way because the first *
is greedy and tries to match as much as possible, matches all your suffixes and regex parser never goes beyond parsing [ATCG]
.
The $
should however make it work as you would expect so your regex is perfectly valid for your task, maybe there is some unknown conditions that I couldn't see with your question.
Try ^ATG[ATCG]*?T(?:AA|AG|GA)A{5,}$
I've used lazy *?
instead of *
, and also a non-capturing group (?:)
and A{5,}
instead of A{5}A*
just to optimize.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With