Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tiebreaker for same-length regex alternatives with same starting position

Tags:

regex

sed

gnu-sed

Using GNU sed (with the -r flag for clarity), the following two substitutions on the input string ab give the same result:

s/(.)(.)|(.)(.)$/\2\1\3\4/

and

s/(.)(.)$|(.)(.)/\1\2\4\3/

both give ba. It would appear that the alternative (.)(.) (the one without $) succeeds in both substitutions, regardless of whether its position as the first or second alternative. Why is this the case? What is the tie-breaker for such alternatives?


The POSIX specification of regular expressions specifies1 the tiebreaker for when the alternatives start at different positions (in which case the earlier one is favoured), and when they start at the same position but have different lengths (the longer one is favoured), but it does not appear to specify the behaviour of capturing groups when two alternatives start at the same position and have the same length, thus leaving it to the specific implementation.

The search for a matching sequence starts at the beginning of a string and stops when the first sequence matching the expression is found, where "first" is defined to mean "begins earliest in the string". If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched. [...] – The Open Group Base Specifications Issue 7, 2018 edition

Here is a running example of the phenomenon.

echo ab|sed -r 's/(.)(.)|(.)(.)$/\2\1\3\4/'
echo ab|sed -r 's/(.)(.)$|(.)(.)/\1\2\4\3/'

Try it online!

like image 333
user41805 Avatar asked Nov 06 '22 10:11

user41805


1 Answers

Hypothesizing an answer.

Given the input line ab, both (.)(.) and (.)(.)$ will match ab with the same length of 2. So, as you say in your question, the two regexps match with the same length from the same starting point.

However, I would say that, at the moment (.)(.) matches against ab, the engine would have one more check to do (against $) in order to check if (.)(.)$ matches too (i.e. if it's matching at the EOL), in which case the latter regexp would not be preferred anyway, because it has the same length and started at the same point as the former regex which was already matched. So it makes sense to me that the engine returns the references to the groups in (.)(.) without $.

I think this reasoning of mine implies that the matching is greedy against printable characters, but lazy against non-printable ones.

Compare/contrast with this:

echo ab|sed -r 's/^(.)(.)|(.)(.)/\2\1\3\4/'
ba
echo ab|sed -r 's/(.)(.)|^(.)(.)/\2\1\3\4/'
ba

where the leftmost regex matches in both cases.

like image 176
Enlico Avatar answered Nov 12 '22 19:11

Enlico