Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to Match mRNA Sequences

In eukaryotes spliced mRNA has three key properties:

  1. mRNA starts with a start codon (ATG)
  2. The coding part of the mRNA ends with one of three stop codons (TAA/TAG/TGA)
  3. Immediately after the stop codon there is a 'poly(A) tail'. The poly(A) tail is a run of many adenines (A's) attached to the 3' end of the coding sequence after transcription. In reality there maybe hundreds of A's in the poly(A) tail, but usually the end of the mRNA/cDNA is not entirely sequenced, so there may be as few as 5 A's following the stop codon.

So basically, an mRNA sequence should start with ATG, be followed by any number of As, Cs, Ts or Gs, then TAA or TAG or TGA, then 5 or more As.

My (python) regex is this: ^ATG[ATCG]*T(AA|AG|GA)A{5}A*$

However, this is matching sequences which have further characters after the poly(A) tail as if the $ character is not being recognized. What am I doing wrong?

Valid Examples:

ATGCTGATGATGATGATAGAAAAA
ATGTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Invalid Examples:

ATGCTGATGXTGATGATAGAAAAA
TATGCTGATGXTGATGATAGAAAAA
ATGTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC

EDIT (My full code):

file = open('potential_mRNA.fasta')
alignment = SeqIO.parse(file, 'fasta')
mRNA_seqs = []
mRNA_pattern = r'^ATG[ATCG]*T(AA|AG|GA)A{5}A*$'
for mrna in alignment:
    sequence = str(mrna.seq)
    if re.search(mRNA_pattern, sequence):
        mRNA_seqs.append(sequence)
like image 863
Sian Avatar asked Feb 28 '19 17:02

Sian


1 Answers

It works that way because the first * is greedy and tries to match as much as possible, matches all your suffixes and regex parser never goes beyond parsing [ATCG].

The $ should however make it work as you would expect so your regex is perfectly valid for your task, maybe there is some unknown conditions that I couldn't see with your question.

Try ^ATG[ATCG]*?T(?:AA|AG|GA)A{5,}$

I've used lazy *? instead of *, and also a non-capturing group (?:) and A{5,} instead of A{5}A* just to optimize.

like image 112
necauqua Avatar answered Oct 14 '22 17:10

necauqua