Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are lookbehind assertions not supported in Javascript?

Tags:

Recently I realized (by some embarrassment) that regex lookbehind assertions were not possible in Javascript.

What is the (factual) reason for the absence for this assertion so seemingly common?

I realize there are alternate ways to achieve the same thing perhaps, although Is it the basic semantics at work which forbid the functionality, or what exactly?

It also seems that some regex testing tools out there that generate Javascript code from regex patterns seem to ignore this fact — which strikes me as a bit odd.

like image 744
l'L'l Avatar asked May 08 '15 08:05

l'L'l


People also ask

Does JavaScript support Lookbehind?

The positive lookbehind ( (? <= ) ) and negative lookbehind ( (? <! ) ) zero-width assertions in JavaScript regular expressions can be used to ensure a pattern is preceded by another pattern.

Does JavaScript support negative Lookbehind?

Since 2018, Lookbehind Assertions are part of the ECMAScript language specification. As Javascript supports negative lookahead, one way to do it is: reverse the input string. match with a reversed regex.

What is Lookbehind assertion?

Regex Lookbehind is used as an assertion in Python regular expressions(re) to determine success or failure whether the pattern is behind i.e to the right of the parser's current position. They don't match anything. Hence, Regex Lookbehind and lookahead are termed as a zero-width assertion.

Does Safari support negative Lookbehind?

Please Note: Lookbehind is not supported in non-V8 browsers, such as Safari, Internet Explorer.


2 Answers

Today

Lookbehind is now an official part of the ES 2018 specification. Axel Rauschmayer gives a good introduction in his blog post.

History

It looks like at the time, Brendan Eich wasn't aware of its existence (because Netscape was built on an older version of Perl):

This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!), but we proposed to ECMA TC39 TG1 (the JS group -- things were different then, including capitalization) something based on Perl 5. We didn't get everything, and we had to rationalize some obvious quirks.

I don't remember lookbehind (which emerged in Perl 5.005 in July '98) being left out on purpose. Waldemar may recall more, I'd handed him the JS keys inside netscape.com to go do mozilla.org.

If you are game to write a proposal or mini-spec (in the style of ES5 even), let me know. I'll chat with other TC39'ers next week about this.

/be

There have been a bunch of different on the mailing list with attempts to include it, but it still seems to be a pretty complex feature performance-wise, because EcmaScript Regular Expressions are backtracking based and backtracking is needed in lookbehind when working with capturing groups. This can lead to problems such as catastrophic backtracking when used incorrectly.

At some point it was suggested for ES6/Es 2015, but it never made the draft, let alone the specification. In the last post in the discussion, it seems that nobody took up the task of implementing it. If anybody feels called to write an implementation, they can sign up for the ES Discuss list and propose it.

Update May 2015:

In May 2015, Nozomu Katō has proposed an ES7 look-behind implementation.

Update September 2015:

Regex Look-behind was added as a stage 0 proposal.

Update May 2017:

The proposal is now at stage 3. This means that now at least two browsers need to implement it for it to become a part of the next EcmaScript standard. As @martixy mentioned in the comments, Chrome has implemented it behind the JS experimental flag.

like image 141
nils Avatar answered Oct 02 '22 17:10

nils


To speak from the conclusion, I think look-behind is not implemented in JavaScript, since no one has any idea how it should behave, and existing implementations show that adding support for look-behind is rather complex.

JavaScript/ECMAScript is different from other languages in the sense that the specification includes an abstract implementation of the regex engine, while most other language only stops short at description of behavior of each pieces of regex syntax, with scant description of how different tokens interacts with each other.

Look-ahead? Easy to implement

The implementation of look-ahead is quite straight-forward. You only need to treat the pattern inside the look-ahead in the same manner as pattern outside look-ahead, and perform a left-to-right match as per usual, except that after the look-ahead succeeds 1) the current position is restored to before entering the look-ahead, and 2) choice points inside look-ahead are discarded after it is matched.

There is no limit to what can be included inside look-ahead, since it is a very simple extension to the existing natural left-to-right matching facilities.

Look-behind? Not so easy

On the other hand, the implementation of look-behind is not as straight forward.

Imagine how you would implement the following look-behind construct:

(?<=fixed-string) (?<=a|fixed|string) (?<=t[abc]{1,3}) (?<=(abc){2,6}) (?<=^.*abc.*) (?<=\G"[^"]+"); (?<=^(.....|.......)+) \b(\w+)\b(?<!\b\1\b.*\1) 

Apart from the basic case (?<=fixed-string), which any look-behind implementation must support, (?<=a|fixed|string) is a much desirable case to support.

Different regex engine has varied level of support for the regex above.

Let us look at how they are implemented in .NET and Java. (This is the two flavors whose look-behind behavior I have studied.)

.NET implementation

In Microsoft .NET implementation, all those regex above are valid, since .NET implements look-behind by using right-to-left mode, with the starting offset at the current position. The look-behind construct doesn't generate any choice point by itself.

However, if you use capturing groups inside the look-behind, it starts to get confusing, since the atoms in the patterns are interpreted from right-to-left, as demonstrated in this post. This is the disadvantage of this method: you would need to wrap your mind to think right-to-left when writing a look-behind.

Java implementation

In contrast, Java regex implementation implements look-behind by reusing the left-to-right matching facilities.

It first analyzes the pattern inside the look-behind for the minimum and maximum length of the pattern. Then, look-behind is implemented by trying to match the pattern inside from left-to-right, starting from (current position - minimum length) to (current position - maximum length).

Is there anything missing? Yes! Since we are matching from left-to-right, we need to make sure that the match ends right at the position before entering the look-behind (current position). In Java, this is implemented by appending a node at the end of the pattern inside look-behind.

This implementation is very inefficient, as there are maximum - minimum + 1 choice points created in the look-behind itself, before we even talk about choice points created by the pattern inside the look-behind.

The look-behind bound check is also inefficient, since it is placed at the end of the pattern, and can't prune choice points that are clearly hopeless (those already far exceeding the current position in the middle of pattern).

Summary

As you can see, adding support for look-behind is not easy:

  • The right-to-left approach seems reasonably efficient. However, it requires additional specification on the right-to-left matching behavior on other existing constructs.
  • The approach to reuse left-to-right matching facilities is complex to specify, and is very inefficient. It also requires the introduction of pattern analysis into the specification, lest the performance is thrown out of the window.

(Note that I have not yet covered the behavior when look-behind is used inside look-ahead, and vice-versa. This should also be taken into consideration when defining semantics for look-behind construct).

These technical hurdles are also mentioned by Waldemar Horwat (who wrote the ES3 regex spec) in the mail cited in nils' answer:

No one has yet submitted a well-defined proposal for lookbehinds on the table. Lookbehinds are difficult to translate into the language used by the spec and get quite fuzzy when the order of evaluation of parts of the regexp matters, which is what happens if capturing parentheses are involved. Where do you start looking for the lookbehind? Shortest first, longest first, or reverse string match? Greedy or not? Backtrack into capturing results?

like image 43
nhahtdh Avatar answered Oct 02 '22 17:10

nhahtdh