Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does a lookahead in an optional 0-width capture group prevent the group from matching?

Consider the following regex:

(^.)?

This matches a single character at the start of the string, if possible:

>> 'ab'.match(/(^.)?/)
Array [ "a", "a" ]

However, wrapping the . in a lookahead causes it to stop working:

>> 'ab'.match(/(^(?=.))?/)
Array [ "", undefined ]

The value of undefined indicates that the group didn't match, rather than having matched an empty string. But I don't understand how the lookahead prevents the group from matching. I would have expected to get a result of ["", ""] here.

Even more curiously, this is only the case if the surrounding capture group has a width of 0. If we change the ^ anchor to something longer, it works correctly again:

>> 'ab'.match(/(a(?=.))?/)
Array [ "a", "a" ]

Removing the ? that makes the group optional fixes the output as well:

>> 'ab'.match(/(^(?=.))/)
Array [ "", "" ]

Can someone explain why this happens? It doesn't make any sense to me.

like image 337
Aran-Fey Avatar asked Apr 28 '18 07:04

Aran-Fey


1 Answers

This doesn’t need to involve lookaheads. Any group that ends up with an empty match and is itself optional won’t match.

> /()/.exec('foo')
['', '']

> /()?/.exec('foo')
['', undefined]

It’s pretty weird, yep.

> /(.*?)/.exec('foo')
['', '']

> /(.*?)?/.exec('foo')
['f', 'f']

There’s a V8 test case that suggests the behaviour is expected. This part of the spec

If min is zero and y's endIndex is equal to x's endIndex, return failure.

seems relevant but is really hard to understand. If it’s actually causing the behaviour here (while trying to avoid having a group match consecutive empty strings?), I’d consider it a spec bug. Other languages don’t behave the same. (Not that they have to, but it’s another strike.)

Actually, the behaviour has been described before with a comment about being explained in the spec, but it’s really not explained at all. (There’s an (a*)* note with no corresponding output, plus the aforequoted step which is offered without justification except in some other notes about the problem of repeating empty matches which, again, everyone else seems to have solved in the more intuitive way.)

Python

>>> re.match(r'(.*?)?', 'foo').group(0, 1)
('', '')

.NET

> Dim m = Regex.Match("foo", "(.*?)?")
> m.Success
True
> m.Length
0

Ruby

> 'foo' =~ /(.*?)?/
0
> $1
""

Perl

> 'foo' =~ /(.*?)?/
('')
like image 196
8 revs Avatar answered Oct 23 '22 17:10

8 revs