Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why "ab(cd|c)*d" matches "abcdcdd" completely but "ab(c|cd)*d" does not match that? Whereas they're like each other

Tags:

regex

I tried this regex:

ab(cd|c)*d

in the regex101 and RegExr websites. It matched this text completely:

abcdcdd

Now let's swap "cd" and "c" in the regex:

ab(c|cd)*d

When I try this regex in the websites, I see this regex does not completely match the same text.

Why doesn't the regex engine recognize that ab(cd|c)*d and ab(c|cd)*d are the same, and how can I persuade ab(c|cd)*d to match the longest string?


REGEX: ab(cd|c)*d

Complete text matched in 13 steps: abcdcdd


REGEX: ab(c|cd)*d

Partial text matched in 9 steps: abcdcdd

like image 885
tiki Avatar asked Dec 31 '22 16:12

tiki


1 Answers

@MurrayW's answer is excellent, but I would like to add some background information.

Regex as Finite State Automata

When I first learned regular expressions in university, we learned to convert them to finite state automata, essentially compiling them into graphs that were then processed to match the string. When you do that, (cd|c) and (c|cd) get compiled into the same graph, in which case both of your regular expressions would match the whole string. This is what grep actually does:

Both

echo abcdcdd | grep --color -E 'ab(c|cd)*d'

and

echo abcdcdd | grep --color -E 'ab(cd|c)*d'

color the whole string in red.

Patterns we call "regular expressions"

True finite state automata have many limitations that programmers don't like, such as the inability to capture matching groups, of to reuse those groups later in the pattern, and other limitations I forget, so the regular expression libraries that we use in most programming languages implement more complex formalisms. I don't remember that they are exactly, maybe push-down automata, but we have memory, we have backtracking, and all sorts of good stuff we use without thinking about it.

At the risk of seeming pedantic, the patterns we use are not "regular" at all. I know, the difference is usually not relevant, we just want our code to work, but once in a while it matters.

So, while the regular expressions (cd|c) and (c|cd) would be compiled into the same finite state automaton, those two (non-regular) patterns are instead turned into logic that says try the variants from left to right, and backtrack only if the rest of the pattern fails to match later, hence the results you observed.

Speed

While the patterns our "regular expression" libraries support offer us lots of goodies we like, those come at a performance cost. True regular expressions are blazingly fast, while our patterns, though usually fast, can sometimes be very expensive. Search for "catastrophic backtracking" on this site for many examples of patterns that take exponential time to fail. The same patterns, used with grep, would be compiled into a graph that is applied in linear time to the string to match no matter what.

like image 154
joanis Avatar answered Jan 11 '23 04:01

joanis