Why does the following regex capture (via the capturing group) the string 'abc' in Javascript, but not in PCRE (although it will still match)?
(.*)*
In general, Perl uses a backtrack regex engine. Such an engine is flexible, easy to implement and very fast on a subset of regex. However, for other types of regex, for example when there is the | operator, it may become very slow.
Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997.
Introduction to Regular ExpressionsA regular expression (also called regex for short) is a fast way to work with strings of text. By formulating a regular expression with a special syntax, you can: search for text in a string.
They are different; One difference is Python supports Unicode and Javascript doesn't. Read Mastering Regular Expressions. It gives information on how to identify the back-end engines (DFA vs NFA vs Hybrid) that a regex flavour uses. It gives tons of information on the different regex flavours out there.
Here's why the capture group is empty in PCRE:
Initial state
(.*)*     abc
 ^        ^
First, the (.*) part is matched against abc, and the input position is advanced to the end. The capture group contains abc at this point.
(.*)*     abc
    ^        ^
Now, the input position is after the c character, the remaining input is the empty string. The Kleene star initiates a second attempt at matching the (.*) group:
(.*)*     abc
 ^           ^
The (.*) group matches the empty string after abc. Since it matched, the previously captured string is overwritten.
Since the input position hasn't advanced, the * ends iterating there and the match succeeds.
The behavior difference between JS and PCRE is due to the way the regex engines are specified. PCRE's behavior is consistent with Perl:
PCRE:
$ pcretest
PCRE version 8.39 2016-06-14
  re> /(.*)*/
data> abc
 0: abc
 1:
Perl:
$ perl -e '"abc" =~ /(.*)*/; print "<$&> <$1>\n";'
<abc> <>
Let's compare this with .NET, which has the same behavior, but supports multiple captures:

When a capture group is matched for a second time, .NET will add the captured value to a capture stack. Perl and PCRE will simply overwrite it.
As for JavaScript:
Here's ECMA-262 §21.2.2.5.1 Runtime Semantics: RepeatMatcher Abstract Operation:
The abstract operation RepeatMatcher takes eight parameters, a Matcher
m, an integermin, an integer (or ∞)max, a Booleangreedy, a Statex, a Continuationc, an integerparenIndex, and an integerparenCount, and performs the following steps:
- If
 maxis zero, returnc(x).- Create an internal Continuation closure
 dthat takes one State argumentyand performs the following steps when evaluated:
- a. If
 minis zero andy'sendIndexis equal tox'sendIndex, returnfailure.- b. If
 minis zero, letmin2be zero; otherwise letmin2bemin‑1.- c. If
 maxis ∞, letmax2be ∞; otherwise letmax2bemax‑1.- d. Call
 RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount)and return its result.- Let
 capbe a fresh copy ofx's captures List.- For every integer
 kthat satisfiesparenIndex < kandk ≤ parenIndex+parenCount, setcap[k]toundefined.- Let
 ebex's endIndex.- Let
 xrbe the State(e, cap).- If
 minis not zero, returnm(xr, d).- If
 greedyisfalse, then
- a. Call
 c(x)and letzbe its result.- b. If
 zis notfailure, returnz.- c. Call
 m(xr, d)and return its result.- Call
 m(xr, d)and letzbe its result.- If
 zis notfailure, returnz.- Call
 c(x)and return its result.
This is basically the definition of what's supposed to be happening when a quantifier is evaluated. RepeatMatcher is the operation handling the matching of an inner operation m.
You'll also need to understand what a State is (§21.2.2.1, emphasis mine):
A State is an ordered pair (
endIndex,captures) whereendIndexis an integer and captures is a List ofNcapturingParensvalues. States are used to represent partial match states in the regular expression matching algorithms. TheendIndexis one plus the index of the last input character matched so far by the pattern, whilecapturesholds the results of capturing parentheses. Thenth element ofcapturesis either a List that represents the value obtained by thenth set of capturing parentheses or undefined if thenth set of capturing parentheses hasn't been reached yet. Due to backtracking, many States may be in use at any time during the matching process.
For your example, the RepeatMatcher parameters are:
m: the Matcher responsible for handling the subpattern (.*)
min: 0 (minimum number of matches for the Kleene star quantifier)max: ∞ (maximum number of matches for the Kleene star quantifier)greedy: true (the Kleene star quantifier used is greedy)x: (0, [undefined]) (see the state definition above)c: The continuation, at this point it'll be: a Continuation that always returns its State argument as a successful MatchResult, after collapsing the parent rulesparenIndex: 0 (as per §21.2.2.5 this is the number of left capturing parentheses in the entire regular expression that occur to the left of this
production)parenCount: 1 (same spec paragraph, this is the number of left capturing parentheses in the expansion of this production's Atom - I won't paste the full spec here but this basically means that m defines one capture group)The regex matching algorithm in the spec is defined in terms of continuation-passing style. Basically, this means that the c operation means what should happen next.
Let's unroll this algorithm.
On the first pass, the x1 state is (0, [undefined]).
max is not zerod1 at this point, it'll be used in the second pass so we'll come back to this one later.cap1 of the capture listcap1 to undefined, this is a no-op in the fist passe1 = 0xr1 = (e1, cap1)min is zero, skip this stepgreedy is true, skip this stepz1 = m(xr, d1) - this evaluates the subpattern (.*)
Now let's step back a bit - m will match (.*) against abc, and then call our d1 closure, so let's unroll that one.
d1 is evaluated with the state y1 =(3, ["abc"]):
min is 0, but y1's endIndex is 3 and x1's endIndex is 0, so don't return failure
min2 = min = 0 since min = 0max2 = max = ∞ since max = ∞RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount) and return its result. That is: RepeatMatcher(m, 0, ∞, false, y1, c, 0, 1)So, right now we're going for a second iteration with x2 = y1 = (3, ["abc"]).
max is not zerod2 at this pointcap2 of the capture list ["abc"]
cap2 to undefined, we get cap2 = [undefined]
e2 = 3xr2 = (e2, cap2)min is zero, skip this stepgreedy is true, skip this stepLet z2 = m(xr2, d2) - this evaluates the subpattern (.*)
This time m will match the empty string after abc, and call our d2 closure with that one. Let's evaluate what d2 does. It's parameter is y2 = (3, [""])
min is still 0, but y2's endIndex is 3 and x2's endIndex is also 3 (remember that this time x is the y state from the previous iteration), the closure simply returns failure.
z2 is failure, skip this stepc(x2), which is c((3, ["abc"])) in this iteration.c simply returns a valid MatchResult here as we're at the end of the pattern. Which means that d1 returns this result, and the first iteration returns passes it along from step 10.
Basically, as you can see, the spec line which causes the JS behavior to be different than PCRE's is the following one:
a. If
minis zero andy'sendIndexis equal tox'sendIndex, returnfailure.
When combined with:
- Call
 c(x)and return its result.
Which returns the previously captured values if the iteration fails.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With