Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does this regex run differently in sed than in Perl/Ruby?

Tags:

regex

ruby

sed

perl

I have a regex that gives me one result in sed but another in Perl (and Ruby).

I have the string one;two;;three and I want to highlight the substrings delimited by the ;. So I do the following in Perl:

$a = "one;two;;three";
$a =~ s/([^;]*)/[\1]/g;
print $a;

(Or, in Ruby: print "one;two;;three".gsub(/([^;]*)/, "[\\1]").)

The result is:

[one][];[two][];[];[three][]

(I know the reason for the spurious empty substrings.)

Curiously, when I run the same regexp in sed I get a different result. I run:

echo "one;two;;three" | sed -e 's/[^;]*/[\0]/g'

and I get:

[one];[two];[];[three]

What is the reason for this different result?

EDIT:

Somebody replied "because sed is not perl". I know that. The reason I'm asking my question is because I don't understand how sed copes so well with zero-length matches.

like image 513
Niccolo M. Avatar asked Dec 23 '13 13:12

Niccolo M.


1 Answers

This is an interesting and surprising edge case.

Your [^;]* pattern may match the empty string, so it becomes a philosophy question, viz., how many empty strings are between two characters: zero, one, or many?

sed

The sed match clearly follows the philosophy described in the “Advancing After a Zero–Length Regex Match” section of “Zero–Length Regex Matches.”

Now the regex engine is in a tricky situation. We’re asking it to go through the entire string to find all non–overlapping regex matches. The first match ended at the start of the string, where the first match attempt began. The regex engine needs a way to avoid getting stuck in an infinite loop that forever finds the same zero-length match at the start of the string.

The simplest solution, which is used by most regex engines, is to start the next match attempt one character after the end of the previous match, if the previous match was zero–length.

That is, zero empty strings are between characters.

The above passage is not an authoritative standard, and quoting such a document instead would make this a better answer.

Inspecting the source of GNU sed, we see

/* Start after the match.  last_end is the real end of the matched
   substring, excluding characters that were skipped in case the RE
   matched the empty string.  */
start = offset + matched;
last_end = regs.end[0];

Perl and Ruby

Perl’s philosophy with s///, which Ruby seems to share—so the documentation and examples below use Perl to represent both—is there is exactly one empty string after each character.

The “Regexp Quote–Like Operators” section of the perlop documentation reads

The /g modifier specifies global pattern matching—that is, matching as many times as possible within the string.

Tracing execution of s/([^;]*)/[\1]/g gives

  1. Start. The “match position,” denoted by ^, is at the beginning of the target string.

     o n e ; t w o ; ; t h r e e
    ^
    
  2. Attempt to match [^;]*.

     o n e ; t w o ; ; t h r e e
          ^
    

    Note that the result captured in $1 is one.

  3. Attempt to match [^;]*.

     o n e ; t w o ; ; t h r e e
          ^
    

    Important Lesson: The * regex quantifier always succeeds because it means “zero or more.” In this case, the substring in $1 is the empty string.

The rest of the match proceeds as in the above.

Being a perceptive reader, you now ask yourself, “Self, if * always succeeds, how does the match terminate at the end of the target string, or for that matter, how does it get past even the first zero–length match?”

We find the answer to this incisive question in the “Repeated Patterns Matching a Zero–length Substring” section of the perlre documentation.

However, long experience has shown that many programming tasks may be significantly simplified by using repeated subexpressions that may match zero–length substrings. Here’s a simple example being:

@chars = split //, $string; # // is not magic in split
($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /

Thus Perl allows such constructs, by forcefully breaking the infinite loop. The rules for this are different for lower–level loops given by the greedy quantifiers *+{}, and for higher-level ones like the /g modifier or split operator.

The higher–level loops preserve an additional state between iterations: whether the last match was zero–length. To break the loop, the following match after a zero–length match is prohibited to have a length of zero. This prohibition interacts with backtracking … and so the second best match is chosen if the best match is of zero length.

Other Perl approaches

With the addition of a negative lookbehind assertion, you can filter the spurious empty matches.

  $ perl -le '$a = "one;two;;three";
              $a =~ s/(?<![^;])([^;]*)/[\1]/g;
              print $a;'
  [one];[two];[];[three]

Apply what Mark Dominus dubbed Randal’s Rule, “Use capturing when you know what you want to keep. Use split when you know what you want to throw away.” You want to throw away the semicolons, so your code becomes more direct with

$ perl -le '$a = "one;two;;three";
            $a = join ";", map "[$_]", split /;/, $a;
            print $a;'
[one];[two];[];[three]
like image 110
Greg Bacon Avatar answered Sep 28 '22 18:09

Greg Bacon