Regex to Match mRNA Sequences

Question

In eukaryotes spliced mRNA has three key properties:

mRNA starts with a start codon (ATG)
The coding part of the mRNA ends with one of three stop codons (TAA/TAG/TGA)
Immediately after the stop codon there is a 'poly(A) tail'. The poly(A) tail is a run of many adenines (A's) attached to the 3' end of the coding sequence after transcription. In reality there maybe hundreds of A's in the poly(A) tail, but usually the end of the mRNA/cDNA is not entirely sequenced, so there may be as few as 5 A's following the stop codon.

So basically, an mRNA sequence should start with ATG, be followed by any number of As, Cs, Ts or Gs, then TAA or TAG or TGA, then 5 or more As.

My (python) regex is this: ^ATG[ATCG]*T(AA|AG|GA)A{5}A*$

However, this is matching sequences which have further characters after the poly(A) tail as if the $ character is not being recognized. What am I doing wrong?

Valid Examples:

ATGCTGATGATGATGATAGAAAAA
ATGTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Invalid Examples:

ATGCTGATGXTGATGATAGAAAAA
TATGCTGATGXTGATGATAGAAAAA
ATGTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC

EDIT (My full code):

file = open('potential_mRNA.fasta')
alignment = SeqIO.parse(file, 'fasta')
mRNA_seqs = []
mRNA_pattern = r'^ATG[ATCG]*T(AA|AG|GA)A{5}A*$'
for mrna in alignment:
    sequence = str(mrna.seq)
    if re.search(mRNA_pattern, sequence):
        mRNA_seqs.append(sequence)

necauqua · Accepted Answer

It works that way because the first * is greedy and tries to match as much as possible, matches all your suffixes and regex parser never goes beyond parsing [ATCG].

The $ should however make it work as you would expect so your regex is perfectly valid for your task, maybe there is some unknown conditions that I couldn't see with your question.

Try ^ATG[ATCG]*?T(?:AA|AG|GA)A{5,}$

I've used lazy *? instead of *, and also a non-capturing group (?:) and A{5,} instead of A{5}A* just to optimize.

Regex to Match mRNA Sequences

Tags:

python

regex

jupyter-notebook

bioinformatics

Sian

1 Answers

necauqua

Recent Activity

Donate For Us

Regex to Match mRNA Sequences

Tags:

python

regex

jupyter-notebook

bioinformatics

Sian

1 Answers

necauqua

Related questions

Recent Activity

Donate For Us