Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match longest repeating sequence (that is not made of a repeating sequence)

Tags:

regex

Consider the string

aabaabaabaabaab

Clearly this string is made of 5 adjacent occurences of aab, so I want my regex to match aab.

To elaborate: aabaab wouldn't be an acceptable output because it's made by repeating aab. But aab is a valid result because it's not made of a repeated shorter string.

For the sake of the question, let us assume that there can be additional text around the repeated segment (for example 11aabaabaabaabaab22 or even xaabaabaabaabaabaa). Therefore it's not possible to anchor the regex with ^ or $.


Failed idea #1: (.+?)\1+ This captures aa instead of the expected aab.

Failed idea #2: (.+)\1+ This captures aabaab.

Is it possible to do this with pure regex? If yes, is it possible without dynamic-width lookbehind?

like image 509
Aran-Fey Avatar asked Jun 09 '17 19:06

Aran-Fey


People also ask

How do you find the longest repeating string?

The maximum value of LCSRe(i, j) provides the length of the longest repeating substring and the substring itself can be found using the length and the ending index of the common suffix.

How do you find the longest repeating substring in a string C++?

Longest Repeating Substring in C++ Suppose we have a string S, we have to find the length of the longest repeating substring(s). We will return 0 if no repeating substring is present. So if the string is like “abbaba”, then the output will be 2. As the longest repeating substring is “ab” or “ba”.


1 Answers

You can use two lookaheads, the first one searches the longest pattern and the second searches the smallest. The second pattern repeated has to end (at least) at the same position or after the first pattern repeated. To check that, you have to capture the end of the string in the first lookahead and to use a backreference to this capture in the second lookahead.

(?=(.+)\1+(.*))(?=(.+?)\3+\2$)\3+

demo

The result is in the group 3

also:

(?=(.+)\1+(.*))(.+?)\3+(?=\2$)\3*

Note that these two regex patterns give the result for one position in a string. If you want to know what is the shortest pattern that is the longest substring once repeated for all the string, you have to find all of them and to select the longest susbstring with code. You can use a pattern for overlapping results to do that:

(?=(.+)\1+(.*))(?=(.+?)\3+\2$)(?=(\3+))

(using the group 4)

like image 100
Casimir et Hippolyte Avatar answered Oct 12 '22 23:10

Casimir et Hippolyte