Why does the following regex capture (via the capturing group) the string 'abc' in Javascript, but not in PCRE (although it will still match)? <code>(.*)*</code>

Here's why the capture group is empty in PCRE: <ul> <li> Initial state <pre class="prettyprint"><code>(.*)* abc ^ ^ </code></pre> </li> <li> First, the <code>(.*)</code> part is matched against <code>abc</code>, and the input position is advanced to the end. The capture group contains <code>abc</code> at this point. <pre class="prettyprint"><code>(.*)* abc ^ ^ </code></pre> </li> <li> Now, the input position is after the <code>c</code> character, the remaining input is the empty string. The Kleene star initiates a second attempt at matching the <code>(.*)</code> group: <pre class="prettyprint"><code>(.*)* abc ^ ^ </code></pre> </li> <li>The <code>(.*)</code> group matches the empty string after <code>abc</code>. Since it matched, the previously captured string is overwritten.</li> <li>Since the input position hasn't advanced, the <code>*</code> ends iterating there and the match succeeds.</li> </ul> The behavior difference between JS and PCRE is due to the way the regex engines are specified. PCRE's behavior is consistent with Perl: PCRE: <pre class="prettyprint"><code>$ pcretest PCRE version 8.39 2016-06-14 re> /(.*)*/ data> abc 0: abc 1: </code></pre> Perl: <pre class="prettyprint"><code>$ perl -e '"abc" =~ /(.*)*/; print "<$&> <$1>\n";' <abc> <> </code></pre> Let's compare this with .NET, which has the same behavior, but supports multiple captures: <img src="https://i.stack.imgur.com/Jaczl.png" alt=".NET regex"> When a capture group is matched for a second time, .NET will add the captured value to a capture stack. Perl and PCRE will simply overwrite it. <hr> As for JavaScript: Here's ECMA-262 §21.2.2.5.1 Runtime Semantics: RepeatMatcher Abstract Operation: <blockquote> The abstract operation RepeatMatcher takes eight parameters, a Matcher <code>m</code>, an integer <code>min</code>, an integer (or ∞) <code>max</code>, a Boolean <code>greedy</code>, a State <code>x</code>, a Continuation <code>c</code>, an integer <code>parenIndex</code>, and an integer <code>parenCount</code>, and performs the following steps: <ol> <li>If <code>max</code> is zero, return <code>c(x)</code>.</li> <li>Create an internal Continuation closure <code>d</code> that takes one State argument <code>y</code> and performs the following steps when evaluated: <ul> <li>a. If <code>min</code> is zero and <code>y</code>'s <code>endIndex</code> is equal to <code>x</code>'s <code>endIndex</code>, return <code>failure</code>.</li> <li>b. If <code>min</code> is zero, let <code>min2</code> be zero; otherwise let <code>min2</code> be <code>min‑1</code>.</li> <li>c. If <code>max</code> is ∞, let <code>max2</code> be ∞; otherwise let <code>max2</code> be <code>max‑1</code>.</li> <li>d. Call <code>RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount)</code> and return its result.</li> </ul> </li> <li>Let <code>cap</code> be a fresh copy of <code>x</code>'s captures List.</li> <li>For every integer <code>k</code> that satisfies <code>parenIndex < k</code> and <code>k ≤ parenIndex+parenCount</code>, set <code>cap[k]</code> to <code>undefined</code>.</li> <li>Let <code>e</code> be <code>x</code>'s endIndex.</li> <li>Let <code>xr</code> be the State <code>(e, cap)</code>.</li> <li>If <code>min</code> is not zero, return <code>m(xr, d)</code>.</li> <li>If <code>greedy</code> is <code>false</code>, then <ul> <li>a. Call <code>c(x)</code> and let <code>z</code> be its result.</li> <li>b. If <code>z</code> is not <code>failure</code>, return <code>z</code>.</li> <li>c. Call <code>m(xr, d)</code> and return its result.</li> </ul> </li> <li>Call <code>m(xr, d)</code> and let <code>z</code> be its result.</li> <li>If <code>z</code> is not <code>failure</code>, return <code>z</code>.</li> <li>Call <code>c(x)</code> and return its result.</li> </ol> </blockquote> This is basically the definition of what's supposed to be happening when a quantifier is evaluated. <code>RepeatMatcher</code> is the operation handling the matching of an inner operation <code>m</code>. You'll also need to understand what a State is (§21.2.2.1, emphasis mine): <blockquote> A State is an ordered pair (<code>endIndex</code>, <code>captures</code>) where <code>endIndex</code> is an integer and captures is a List of <code>NcapturingParens</code> values. States are used to represent partial match states in the regular expression matching algorithms. The <code>endIndex</code> is one plus the index of the last input character matched so far by the pattern, while <code>captures</code> holds the results of capturing parentheses. The <code>n</code>th element of <code>captures</code> is either a List that represents the value obtained by the <code>n</code>th set of capturing parentheses or undefined if the <code>n</code>th set of capturing parentheses hasn't been reached yet. Due to backtracking, many States may be in use at any time during the matching process. </blockquote> For your example, the <code>RepeatMatcher</code> parameters are: <ul> <li> <code>m</code>: the Matcher responsible for handling the subpattern <code>(.*)</code> </li> <li> <code>min</code>: 0 (minimum number of matches for the Kleene star quantifier)</li> <li> <code>max</code>: ∞ (maximum number of matches for the Kleene star quantifier)</li> <li> <code>greedy</code>: true (the Kleene star quantifier used is greedy)</li> <li> <code>x</code>: <code>(0, [undefined])</code> (see the state definition above)</li> <li> <code>c</code>: The continuation, at this point it'll be: a Continuation that always returns its State argument as a successful <code>MatchResult</code>, after collapsing the parent rules</li> <li> <code>parenIndex</code>: 0 (as per §21.2.2.5 this is the number of left capturing parentheses in the entire regular expression that occur to the left of this production)</li> <li> <code>parenCount</code>: 1 (same spec paragraph, this is the number of left capturing parentheses in the expansion of this production's Atom - I won't paste the full spec here but this basically means that <code>m</code> defines one capture group)</li> </ul> The regex matching algorithm in the spec is defined in terms of continuation-passing style. Basically, this means that the <code>c</code> operation means what should happen next. Let's unroll this algorithm. <h3>First iteration</h3> On the first pass, the <code>x</code>1 state is <code>(0, [undefined])</code>. <ol> <li> <code>max</code> is not zero</li> <li>We create the continuation closure <code>d</code>1 at this point, it'll be used in the second pass so we'll come back to this one later.</li> <li>Make a copy <code>cap</code>1 of the capture list</li> <li>Reset the capture in <code>cap</code>1 to <code>undefined</code>, this is a no-op in the fist pass</li> <li>Let <code>e</code>1 = 0</li> <li>Let <code>xr</code>1 = (<code>e</code>1, <code>cap</code>1)</li> <li> <code>min</code> is zero, skip this step</li> <li> <code>greedy</code> is true, skip this step</li> <li>Let <code>z</code>1 = <code>m</code>(<code>xr</code>, <code>d</code>1) - this evaluates the subpattern <code>(.*)</code> </li> </ol> Now let's step back a bit - <code>m</code> will match <code>(.*)</code> against <code>abc</code>, and then call our <code>d</code>1 closure, so let's unroll that one. <code>d</code>1 is evaluated with the state <code>y</code>1 =<code>(3, ["abc"])</code>: <ul> <li> <code>min</code> is 0, but <code>y</code>1's <code>endIndex</code> is 3 and <code>x</code>1's <code>endIndex</code> is 0, so don't return <code>failure</code> </li> <li>Let <code>min2</code> = <code>min</code> = 0 since <code>min</code> = 0</li> <li>Let <code>max2</code> = <code>max</code> = ∞ since <code>max</code> = ∞</li> <li>Call <code>RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount)</code> and return its result. That is: RepeatMatcher(m, 0, ∞, false, y1, c, 0, 1)</li> </ul> <h3>Second iteration</h3> So, right now we're going for a second iteration with <code>x</code>2 = <code>y</code>1 = <code>(3, ["abc"])</code>. <ol> <li> <code>max</code> is not zero</li> <li>We create the continuation closure <code>d</code>2 at this point</li> <li>Make a copy <code>cap</code>2 of the capture list <code>["abc"]</code> </li> <li>Reset the capture in <code>cap</code>2 to <code>undefined</code>, we get <code>cap</code>2 = <code>[undefined]</code> </li> <li>Let <code>e</code>2 = 3</li> <li>Let <code>xr</code>2 = (<code>e</code>2, <code>cap</code>2)</li> <li> <code>min</code> is zero, skip this step</li> <li> <code>greedy</code> is true, skip this step</li> <li> Let <code>z</code>2 = <code>m</code>(<code>xr</code>2, <code>d</code>2) - this evaluates the subpattern <code>(.*)</code> This time <code>m</code> will match the empty string after <code>abc</code>, and call our <code>d</code>2 closure with that one. Let's evaluate what <code>d</code>2 does. It's parameter is <code>y</code>2 = <code>(3, [""])</code> <code>min</code> is still 0, but <code>y</code>2's <code>endIndex</code> is 3 and <code>x</code>2's <code>endIndex</code> is also 3 (remember that this time <code>x</code> is the <code>y</code> state from the previous iteration), the closure simply returns <code>failure</code>. </li> <li> <code>z</code>2 is <code>failure</code>, skip this step</li> <li>return <code>c</code>(<code>x</code>2), which is <code>c((3, ["abc"]))</code> in this iteration.</li> </ol> <code>c</code> simply returns a valid MatchResult here as we're at the end of the pattern. Which means that <code>d</code>1 returns this result, and the first iteration returns passes it along from step 10. Basically, as you can see, the spec line which causes the JS behavior to be different than PCRE's is the following one: <blockquote> a. If <code>min</code> is zero and <code>y</code>'s <code>endIndex</code> is equal to <code>x</code>'s <code>endIndex</code>, return <code>failure</code>. </blockquote> When combined with: <blockquote> <ol start="11"> <li>Call <code>c(x)</code> and return its result.</li> </ol> </blockquote> Which returns the previously captured values if the iteration fails.

Perl vs Javascript regular expressions

1 Answers

Here's why the capture group is empty in PCRE:

Initial state
```
(.*)*     abc
 ^        ^
```
First, the (.*) part is matched against abc, and the input position is advanced to the end. The capture group contains abc at this point.
```
(.*)*     abc
    ^        ^
```
Now, the input position is after the c character, the remaining input is the empty string. The Kleene star initiates a second attempt at matching the (.*) group:
```
(.*)*     abc
 ^           ^
```
The (.*) group matches the empty string after abc. Since it matched, the previously captured string is overwritten.
Since the input position hasn't advanced, the * ends iterating there and the match succeeds.

The behavior difference between JS and PCRE is due to the way the regex engines are specified. PCRE's behavior is consistent with Perl:

PCRE:

$ pcretest
PCRE version 8.39 2016-06-14

  re> /(.*)*/
data> abc
 0: abc
 1:

Perl:

$ perl -e '"abc" =~ /(.*)*/; print "<$&> <$1>\n";'
<abc> <>

Let's compare this with .NET, which has the same behavior, but supports multiple captures:

.NET regex

When a capture group is matched for a second time, .NET will add the captured value to a capture stack. Perl and PCRE will simply overwrite it.

As for JavaScript:

Here's ECMA-262 §21.2.2.5.1 Runtime Semantics: RepeatMatcher Abstract Operation:

The abstract operation RepeatMatcher takes eight parameters, a Matcher m, an integer min, an integer (or ∞) max, a Boolean greedy, a State x, a Continuation c, an integer parenIndex, and an integer parenCount, and performs the following steps:

If max is zero, return c(x).

Create an internal Continuation closure d that takes one State argument y and performs the following steps when evaluated:

a. If min is zero and y's endIndex is equal to x's endIndex, return failure.

b. If min is zero, let min2 be zero; otherwise let min2 be min‑1.

c. If max is ∞, let max2 be ∞; otherwise let max2 be max‑1.

d. Call RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount) and return its result.

Let cap be a fresh copy of x's captures List.

For every integer k that satisfies parenIndex < k and k ≤ parenIndex+parenCount, set cap[k] to undefined.

Let e be x's endIndex.

Let xr be the State (e, cap).

If min is not zero, return m(xr, d).

If greedy is false, then

a. Call c(x) and let z be its result.

b. If z is not failure, return z.

c. Call m(xr, d) and return its result.

Call m(xr, d) and let z be its result.

If z is not failure, return z.

Call c(x) and return its result.

This is basically the definition of what's supposed to be happening when a quantifier is evaluated. RepeatMatcher is the operation handling the matching of an inner operation m.

You'll also need to understand what a State is (§21.2.2.1, emphasis mine):

A State is an ordered pair (endIndex, captures) where endIndex is an integer and captures is a List of NcapturingParens values. States are used to represent partial match states in the regular expression matching algorithms. The endIndex is one plus the index of the last input character matched so far by the pattern, while captures holds the results of capturing parentheses. The nth element of captures is either a List that represents the value obtained by the nth set of capturing parentheses or undefined if the nth set of capturing parentheses hasn't been reached yet. Due to backtracking, many States may be in use at any time during the matching process.

For your example, the RepeatMatcher parameters are:

m: the Matcher responsible for handling the subpattern (.*)
min: 0 (minimum number of matches for the Kleene star quantifier)
max: ∞ (maximum number of matches for the Kleene star quantifier)
greedy: true (the Kleene star quantifier used is greedy)
x: (0, [undefined]) (see the state definition above)
c: The continuation, at this point it'll be: a Continuation that always returns its State argument as a successful MatchResult, after collapsing the parent rules
parenIndex: 0 (as per §21.2.2.5 this is the number of left capturing parentheses in the entire regular expression that occur to the left of this production)
parenCount: 1 (same spec paragraph, this is the number of left capturing parentheses in the expansion of this production's Atom - I won't paste the full spec here but this basically means that m defines one capture group)

The regex matching algorithm in the spec is defined in terms of continuation-passing style. Basically, this means that the c operation means what should happen next.

Let's unroll this algorithm.

First iteration

On the first pass, the x₁ state is (0, [undefined]).

max is not zero
We create the continuation closure d₁ at this point, it'll be used in the second pass so we'll come back to this one later.
Make a copy cap₁ of the capture list
Reset the capture in cap₁ to undefined, this is a no-op in the fist pass
Let e₁ = 0
Let xr₁ = (e₁, cap₁)
min is zero, skip this step
greedy is true, skip this step
Let z₁ = m(xr, d₁) - this evaluates the subpattern (.*)

Now let's step back a bit - m will match (.*) against abc, and then call our d₁ closure, so let's unroll that one.

d₁ is evaluated with the state y₁ =(3, ["abc"]):

min is 0, but y₁'s endIndex is 3 and x₁'s endIndex is 0, so don't return failure
Let min2 = min = 0 since min = 0
Let max2 = max = ∞ since max = ∞
Call RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount) and return its result. That is: RepeatMatcher(m, 0, ∞, false, y₁, c, 0, 1)

Second iteration

So, right now we're going for a second iteration with x₂ = y₁ = (3, ["abc"]).

max is not zero
We create the continuation closure d₂ at this point
Make a copy cap₂ of the capture list ["abc"]
Reset the capture in cap₂ to undefined, we get cap₂ = [undefined]
Let e₂ = 3
Let xr₂ = (e₂, cap₂)
min is zero, skip this step
greedy is true, skip this step
Let z₂ = m(xr₂, d₂) - this evaluates the subpattern (.*)

This time m will match the empty string after abc, and call our d₂ closure with that one. Let's evaluate what d₂ does. It's parameter is y₂ = (3, [""])

min is still 0, but y₂'s endIndex is 3 and x₂'s endIndex is also 3 (remember that this time x is the y state from the previous iteration), the closure simply returns failure.
z₂ is failure, skip this step
return c(x₂), which is c((3, ["abc"])) in this iteration.

c simply returns a valid MatchResult here as we're at the end of the pattern. Which means that d₁ returns this result, and the first iteration returns passes it along from step 10.

Basically, as you can see, the spec line which causes the JS behavior to be different than PCRE's is the following one:

a. If min is zero and y's endIndex is equal to x's endIndex, return failure.

When combined with:

Call c(x) and return its result.

Which returns the previously captured values if the iteration fails.

122

answered Oct 04 '22 06:10

Lucas Trzesniewski

Related questions
                            
                                Moving from Ionic to NativeScript with at least efforts
                            
                                Does Polymer fire() publish the event globally?
                            
                                Chrome console: difference between 'let' and 'var'?
                            
                                vue-router replace parent view with subRoute
                            
                                Copy and Insert in d3 selection
                            
                                Highcharts: xAxis with vertical gridlines
                            
                                Why is [].concat() faster than Array.prototype.concat()?
                            
                                navbar from Bootstrap to reactjs
                            
                                2 different ways to create React component
                            
                                creating a custom "play" button for a video
                            
                                Angular2 Material Design alpha.9-3 has '404 not found' for @angular/material
                            
                                gathering multiple promise results? (plain javascript)
                            
                                How to upload image from img tag?
                            
                                React update state in parent from child components
                            
                                Function inside filereader.onload is not being executed in javascript
                            
                                chart.js color not rendering
                            
                                Chrome / Firefox extension - Content script not listening for messages
                            
                                MouseOut / MouseLeave - Event Triggers on Dropdown-Menu
                            
                                Angular 2 rxjs nested Observables
                            
                                How to replace double/multiple slash to single in url

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Perl vs Javascript regular expressions

Tags:

javascript

regex

perl

pcre

Tim Angus

People also ask

1 Answers

First iteration

Second iteration

Lucas Trzesniewski

Recent Activity

Donate For Us