Why does the following regex capture (via the capturing group) the string 'abc' in Javascript, but not in PCRE (although it will still match)?
(.*)*
In general, Perl uses a backtrack regex engine. Such an engine is flexible, easy to implement and very fast on a subset of regex. However, for other types of regex, for example when there is the | operator, it may become very slow.
Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997.
Introduction to Regular ExpressionsA regular expression (also called regex for short) is a fast way to work with strings of text. By formulating a regular expression with a special syntax, you can: search for text in a string.
They are different; One difference is Python supports Unicode and Javascript doesn't. Read Mastering Regular Expressions. It gives information on how to identify the back-end engines (DFA vs NFA vs Hybrid) that a regex flavour uses. It gives tons of information on the different regex flavours out there.
Here's why the capture group is empty in PCRE:
Initial state
(.*)* abc
^ ^
First, the (.*)
part is matched against abc
, and the input position is advanced to the end. The capture group contains abc
at this point.
(.*)* abc
^ ^
Now, the input position is after the c
character, the remaining input is the empty string. The Kleene star initiates a second attempt at matching the (.*)
group:
(.*)* abc
^ ^
The (.*)
group matches the empty string after abc
. Since it matched, the previously captured string is overwritten.
Since the input position hasn't advanced, the *
ends iterating there and the match succeeds.
The behavior difference between JS and PCRE is due to the way the regex engines are specified. PCRE's behavior is consistent with Perl:
PCRE:
$ pcretest
PCRE version 8.39 2016-06-14
re> /(.*)*/
data> abc
0: abc
1:
Perl:
$ perl -e '"abc" =~ /(.*)*/; print "<$&> <$1>\n";'
<abc> <>
Let's compare this with .NET, which has the same behavior, but supports multiple captures:
When a capture group is matched for a second time, .NET will add the captured value to a capture stack. Perl and PCRE will simply overwrite it.
As for JavaScript:
Here's ECMA-262 §21.2.2.5.1 Runtime Semantics: RepeatMatcher Abstract Operation:
The abstract operation RepeatMatcher takes eight parameters, a Matcher
m
, an integermin
, an integer (or ∞)max
, a Booleangreedy
, a Statex
, a Continuationc
, an integerparenIndex
, and an integerparenCount
, and performs the following steps:
- If
max
is zero, returnc(x)
.- Create an internal Continuation closure
d
that takes one State argumenty
and performs the following steps when evaluated:
- a. If
min
is zero andy
'sendIndex
is equal tox
'sendIndex
, returnfailure
.- b. If
min
is zero, letmin2
be zero; otherwise letmin2
bemin‑1
.- c. If
max
is ∞, letmax2
be ∞; otherwise letmax2
bemax‑1
.- d. Call
RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount)
and return its result.- Let
cap
be a fresh copy ofx
's captures List.- For every integer
k
that satisfiesparenIndex < k
andk ≤ parenIndex+parenCount
, setcap[k]
toundefined
.- Let
e
bex
's endIndex.- Let
xr
be the State(e, cap)
.- If
min
is not zero, returnm(xr, d)
.- If
greedy
isfalse
, then
- a. Call
c(x)
and letz
be its result.- b. If
z
is notfailure
, returnz
.- c. Call
m(xr, d)
and return its result.- Call
m(xr, d)
and letz
be its result.- If
z
is notfailure
, returnz
.- Call
c(x)
and return its result.
This is basically the definition of what's supposed to be happening when a quantifier is evaluated. RepeatMatcher
is the operation handling the matching of an inner operation m
.
You'll also need to understand what a State is (§21.2.2.1, emphasis mine):
A State is an ordered pair (
endIndex
,captures
) whereendIndex
is an integer and captures is a List ofNcapturingParens
values. States are used to represent partial match states in the regular expression matching algorithms. TheendIndex
is one plus the index of the last input character matched so far by the pattern, whilecaptures
holds the results of capturing parentheses. Then
th element ofcaptures
is either a List that represents the value obtained by then
th set of capturing parentheses or undefined if then
th set of capturing parentheses hasn't been reached yet. Due to backtracking, many States may be in use at any time during the matching process.
For your example, the RepeatMatcher
parameters are:
m
: the Matcher responsible for handling the subpattern (.*)
min
: 0 (minimum number of matches for the Kleene star quantifier)max
: ∞ (maximum number of matches for the Kleene star quantifier)greedy
: true (the Kleene star quantifier used is greedy)x
: (0, [undefined])
(see the state definition above)c
: The continuation, at this point it'll be: a Continuation that always returns its State argument as a successful MatchResult
, after collapsing the parent rulesparenIndex
: 0 (as per §21.2.2.5 this is the number of left capturing parentheses in the entire regular expression that occur to the left of this
production)parenCount
: 1 (same spec paragraph, this is the number of left capturing parentheses in the expansion of this production's Atom - I won't paste the full spec here but this basically means that m
defines one capture group)The regex matching algorithm in the spec is defined in terms of continuation-passing style. Basically, this means that the c
operation means what should happen next.
Let's unroll this algorithm.
On the first pass, the x
1 state is (0, [undefined])
.
max
is not zerod
1 at this point, it'll be used in the second pass so we'll come back to this one later.cap
1 of the capture listcap
1 to undefined
, this is a no-op in the fist passe
1 = 0xr
1 = (e
1, cap
1)min
is zero, skip this stepgreedy
is true, skip this stepz
1 = m
(xr
, d
1) - this evaluates the subpattern (.*)
Now let's step back a bit - m
will match (.*)
against abc
, and then call our d
1 closure, so let's unroll that one.
d
1 is evaluated with the state y
1 =(3, ["abc"])
:
min
is 0, but y
1's endIndex
is 3 and x
1's endIndex
is 0, so don't return failure
min2
= min
= 0 since min
= 0max2
= max
= ∞ since max
= ∞RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount)
and return its result. That is: RepeatMatcher(m, 0, ∞, false, y1, c, 0, 1)So, right now we're going for a second iteration with x
2 = y
1 = (3, ["abc"])
.
max
is not zerod
2 at this pointcap
2 of the capture list ["abc"]
cap
2 to undefined
, we get cap
2 = [undefined]
e
2 = 3xr
2 = (e
2, cap
2)min
is zero, skip this stepgreedy
is true, skip this stepLet z
2 = m
(xr
2, d
2) - this evaluates the subpattern (.*)
This time m
will match the empty string after abc
, and call our d
2 closure with that one. Let's evaluate what d
2 does. It's parameter is y
2 = (3, [""])
min
is still 0, but y
2's endIndex
is 3 and x
2's endIndex
is also 3 (remember that this time x
is the y
state from the previous iteration), the closure simply returns failure
.
z
2 is failure
, skip this stepc
(x
2), which is c((3, ["abc"]))
in this iteration.c
simply returns a valid MatchResult here as we're at the end of the pattern. Which means that d
1 returns this result, and the first iteration returns passes it along from step 10.
Basically, as you can see, the spec line which causes the JS behavior to be different than PCRE's is the following one:
a. If
min
is zero andy
'sendIndex
is equal tox
'sendIndex
, returnfailure
.
When combined with:
- Call
c(x)
and return its result.
Which returns the previously captured values if the iteration fails.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With