Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex: matching a repeating sequence

Tags:

python

regex

I'm trying to construct a regular expression that will match a repeating DNA sequence of 2 characters. These characters can be the same.

The regex should match a repeating sequence of 2 characters at least 3 times and, here are some examples:

regex should match on:

  • ATATAT
  • GAGAGAGA
  • CCCCCC

and should not match on:

  • ACAC
  • ACGTACGT

So far I've come up with the following regular expressions:

[ACGT]{2}

this captures any sequence consisting of exactly two characters (A, C, G or T). Now I want to repeat this pattern at least three times, so I tried the following regular expressions:

[ACGT]{2}{3,}
([ACGT]{2}){3,}

Unfortunately, the first one raises a 'multiple repeat' error (Python), while the second one will simply match any sequence with 6 characters consisting of A, C, G and T.

Is there anyone that can help me out with this regular expression? Thanks in advance.

like image 637
user2388809 Avatar asked Feb 15 '23 09:02

user2388809


1 Answers

You could perhaps make use of backreferences.

([ATGC]{2})\1{2,}

\1 is the backreference referring to the first capture group and will be what you have captured.

regex101 demo

like image 78
Jerry Avatar answered Feb 17 '23 01:02

Jerry