Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl vs Javascript regular expressions

Why does the following regex capture (via the capturing group) the string 'abc' in Javascript, but not in PCRE (although it will still match)?

(.*)*

like image 665
Tim Angus Avatar asked Nov 18 '16 14:11

Tim Angus


People also ask

Is Perl good for regex?

In general, Perl uses a backtrack regex engine. Such an engine is flexible, easy to implement and very fast on a subset of regex. However, for other types of regex, for example when there is the | operator, it may become very slow.

What regex engine does Perl use?

Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997.

Is JavaScript regex fast?

Introduction to Regular ExpressionsA regular expression (also called regex for short) is a fast way to work with strings of text. By formulating a regular expression with a special syntax, you can: search for text in a string.

Is Python regex same as JavaScript?

They are different; One difference is Python supports Unicode and Javascript doesn't. Read Mastering Regular Expressions. It gives information on how to identify the back-end engines (DFA vs NFA vs Hybrid) that a regex flavour uses. It gives tons of information on the different regex flavours out there.


1 Answers

Here's why the capture group is empty in PCRE:

  • Initial state

    (.*)*     abc
     ^        ^
    
  • First, the (.*) part is matched against abc, and the input position is advanced to the end. The capture group contains abc at this point.

    (.*)*     abc
        ^        ^
    
  • Now, the input position is after the c character, the remaining input is the empty string. The Kleene star initiates a second attempt at matching the (.*) group:

    (.*)*     abc
     ^           ^
    
  • The (.*) group matches the empty string after abc. Since it matched, the previously captured string is overwritten.

  • Since the input position hasn't advanced, the * ends iterating there and the match succeeds.

The behavior difference between JS and PCRE is due to the way the regex engines are specified. PCRE's behavior is consistent with Perl:

PCRE:

$ pcretest
PCRE version 8.39 2016-06-14

  re> /(.*)*/
data> abc
 0: abc
 1:

Perl:

$ perl -e '"abc" =~ /(.*)*/; print "<$&> <$1>\n";'
<abc> <>

Let's compare this with .NET, which has the same behavior, but supports multiple captures:

.NET regex

When a capture group is matched for a second time, .NET will add the captured value to a capture stack. Perl and PCRE will simply overwrite it.


As for JavaScript:

Here's ECMA-262 §21.2.2.5.1 Runtime Semantics: RepeatMatcher Abstract Operation:

The abstract operation RepeatMatcher takes eight parameters, a Matcher m, an integer min, an integer (or ∞) max, a Boolean greedy, a State x, a Continuation c, an integer parenIndex, and an integer parenCount, and performs the following steps:

  1. If max is zero, return c(x).
  2. Create an internal Continuation closure d that takes one State argument y and performs the following steps when evaluated:
    • a. If min is zero and y's endIndex is equal to x's endIndex, return failure.
    • b. If min is zero, let min2 be zero; otherwise let min2 be min‑1.
    • c. If max is ∞, let max2 be ∞; otherwise let max2 be max‑1.
    • d. Call RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount) and return its result.
  3. Let cap be a fresh copy of x's captures List.
  4. For every integer k that satisfies parenIndex < k and k ≤ parenIndex+parenCount, set cap[k] to undefined.
  5. Let e be x's endIndex.
  6. Let xr be the State (e, cap).
  7. If min is not zero, return m(xr, d).
  8. If greedy is false, then
    • a. Call c(x) and let z be its result.
    • b. If z is not failure, return z.
    • c. Call m(xr, d) and return its result.
  9. Call m(xr, d) and let z be its result.
  10. If z is not failure, return z.
  11. Call c(x) and return its result.

This is basically the definition of what's supposed to be happening when a quantifier is evaluated. RepeatMatcher is the operation handling the matching of an inner operation m.

You'll also need to understand what a State is (§21.2.2.1, emphasis mine):

A State is an ordered pair (endIndex, captures) where endIndex is an integer and captures is a List of NcapturingParens values. States are used to represent partial match states in the regular expression matching algorithms. The endIndex is one plus the index of the last input character matched so far by the pattern, while captures holds the results of capturing parentheses. The nth element of captures is either a List that represents the value obtained by the nth set of capturing parentheses or undefined if the nth set of capturing parentheses hasn't been reached yet. Due to backtracking, many States may be in use at any time during the matching process.

For your example, the RepeatMatcher parameters are:

  • m: the Matcher responsible for handling the subpattern (.*)
  • min: 0 (minimum number of matches for the Kleene star quantifier)
  • max: ∞ (maximum number of matches for the Kleene star quantifier)
  • greedy: true (the Kleene star quantifier used is greedy)
  • x: (0, [undefined]) (see the state definition above)
  • c: The continuation, at this point it'll be: a Continuation that always returns its State argument as a successful MatchResult, after collapsing the parent rules
  • parenIndex: 0 (as per §21.2.2.5 this is the number of left capturing parentheses in the entire regular expression that occur to the left of this production)
  • parenCount: 1 (same spec paragraph, this is the number of left capturing parentheses in the expansion of this production's Atom - I won't paste the full spec here but this basically means that m defines one capture group)

The regex matching algorithm in the spec is defined in terms of continuation-passing style. Basically, this means that the c operation means what should happen next.

Let's unroll this algorithm.

First iteration

On the first pass, the x1 state is (0, [undefined]).

  1. max is not zero
  2. We create the continuation closure d1 at this point, it'll be used in the second pass so we'll come back to this one later.
  3. Make a copy cap1 of the capture list
  4. Reset the capture in cap1 to undefined, this is a no-op in the fist pass
  5. Let e1 = 0
  6. Let xr1 = (e1, cap1)
  7. min is zero, skip this step
  8. greedy is true, skip this step
  9. Let z1 = m(xr, d1) - this evaluates the subpattern (.*)

Now let's step back a bit - m will match (.*) against abc, and then call our d1 closure, so let's unroll that one.

d1 is evaluated with the state y1 =(3, ["abc"]):

  • min is 0, but y1's endIndex is 3 and x1's endIndex is 0, so don't return failure
  • Let min2 = min = 0 since min = 0
  • Let max2 = max = ∞ since max = ∞
  • Call RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount) and return its result. That is: RepeatMatcher(m, 0, ∞, false, y1, c, 0, 1)

Second iteration

So, right now we're going for a second iteration with x2 = y1 = (3, ["abc"]).

  1. max is not zero
  2. We create the continuation closure d2 at this point
  3. Make a copy cap2 of the capture list ["abc"]
  4. Reset the capture in cap2 to undefined, we get cap2 = [undefined]
  5. Let e2 = 3
  6. Let xr2 = (e2, cap2)
  7. min is zero, skip this step
  8. greedy is true, skip this step
  9. Let z2 = m(xr2, d2) - this evaluates the subpattern (.*)

    This time m will match the empty string after abc, and call our d2 closure with that one. Let's evaluate what d2 does. It's parameter is y2 = (3, [""])

    min is still 0, but y2's endIndex is 3 and x2's endIndex is also 3 (remember that this time x is the y state from the previous iteration), the closure simply returns failure.

  10. z2 is failure, skip this step
  11. return c(x2), which is c((3, ["abc"])) in this iteration.

c simply returns a valid MatchResult here as we're at the end of the pattern. Which means that d1 returns this result, and the first iteration returns passes it along from step 10.

Basically, as you can see, the spec line which causes the JS behavior to be different than PCRE's is the following one:

a. If min is zero and y's endIndex is equal to x's endIndex, return failure.

When combined with:

  1. Call c(x) and return its result.

Which returns the previously captured values if the iteration fails.

like image 122
Lucas Trzesniewski Avatar answered Oct 04 '22 06:10

Lucas Trzesniewski