Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do these three regexes have different step counts?

Tags:

regex

pcre

In an answer to a recent question, I contrived a couple of clever little regexes (at the asker's request) to match a substring at either the beginning or the end of a string. When run on Regex101, however, I noted that the different patterns have different step counts (indicating that the regex engine has to do more work for one vs. the other). To my mind, however, there is no intuitive reason that this should be so.

The three patterns are as follows:

  • Fun with conditionals: /(^)?!next(?(1)|$)/ (demo - 86 steps)
  • Classic alternation: ^!next|!next$ (demo - 58 steps)
  • Nasty lookarounds: !next(?:(?<=^.{5})|(?=$)) (demo - 35 steps)

Why is the first pattern so much less efficient than the second, and, most confusingly, why is the third so efficient?

like image 719
Sebastian Lenartowicz Avatar asked Oct 08 '16 05:10

Sebastian Lenartowicz


1 Answers

TL;DR

Why is the first pattern so much less efficient than the second, and, most confusingly, why is the third so efficient?

Because first two are anchored, third is not.

Real story, how steps are taken

Considering this regex /^x/gm, how many steps do you think engine will take to return a "no match" if subject string is abc? You are right, two.

  1. Assert beginning of string
  2. Match x

Then overall match fails since no x immediately comes after beginning of string assertion.

Well I lied. It’s not that I’m nasty, it just makes it easier to understand things that are going to happen. According to regex101.com it takes no steps at all:

enter image description here

Shall you believe it this time? Yes. No. Let's see.

PCRE start-up optimizations

PCRE, being kind to its users, provides some features to speed up things that is called start-up optimization. It does some dependent optimizations in according to Regular Expressions being used.

One important feature of these optimizations is a pre-scan of subject string in order to ensure that:

  • subject string contains at least one character that corresponds to the first character of match or
  • if a known starting point exists.

If one is not found matching function never runs.

Saying that, if our regex is /x/ and our subject string is abc then with start-up optimization enabled, a pre-scan is intended to be done to look for x, if is not found whole match fails or more better it doesn't even bother to go through matching process.

So how do these information help?

Let's flashback to our first example and change our regex a little bit. From:

/^x/gm

to

/^x/g

The difference is m flag that is getting unset. For those who don't know what m flag does if is set:

It changes the meaning of ^ and $ symbols in the sense that they no more mean start and end of string but start and end of line.

Now what if we run this regex /^x/g over our subject string abc? Should we expect a difference in steps engine takes or not? Absolutely, yes. Let's look at regex101.com returned info:

enter image description here

I really encourage you to believe it this time. It's actual.

What's happening?

Well, it seems a little confusing but we are going to enlighten things up. When there is no m modifier set, pre-scan looks to assert start of string (a known starting point). If assertion passes then actual matching function runs otherwise "no match" will return.

But wait... every subject string definitely has one and only start of string position and it's always at the very beginning of it. So wouldn't be a pre-scan obviously unnecessary? Yes, engine doesn't do a pre-scan here. With /^x/g it immediately asserts start of string and then fails like so (since it matches at ^ it goes through actual matching process). That's why we see regex101.com shows number of steps are 2.

But... with setting m modifier things differ. Now meaning of both ^ and $ anchors are changed. With ^ matching start of line, assertion of the same position in subject string abc happens but next immediate character is not x, being within actual matching process and since g flag is on, next match starts at position before b and fails and this trial and error continues up to end of subject string.

enter image description here

Debugger shows 6 steps but main page says 0 steps, why?

I'm not sure about latter but for the sake of debugging, regex101 debugger runs with (*NO_START_OPT) so 6 steps are true only if this verb is set. And I said I'm not sure about latter because all anchored patterns prevent a further pre-scan optimization and we should know what can be called an anchored pattern:

A pattern is automatically anchored by PCRE if all of its top-level alternatives begin with one of the following:

  • ^ unless PCRE_MULTILINE is set
  • \A always
  • \G always
  • .* if PCRE_DOTALL is set and there are no back references to the subpattern in which .* appears

Now you got entirely what I was talking about when I was saying no pre-scan happens while m flag is not set in /^x/g: It's considered an anchored pattern which disables pre-scan optimization. So when m flag is on, this is no more an anchored pattern: /^x/gm hence pre-scan optimization could take place.

enter image description here

Engine knows start of string anchor \A (or ^ while multiline mode is disable) occurs only once when is matched so it doesn't continue at the next position.

Back to your own RegExes

First two are anchored (^ in conjunction with m flag), third is not. That is, third regex benefits from a pre-scan optimization. You can believe in 35 steps since an optimization caused it. But if you disable start-up optimization:

(*NO_START_OPT)!next(?:(?<=^.{5})|(?=$))

You will see 57 steps which is mostly the same as number of debugger steps.

like image 164
revo Avatar answered Oct 17 '22 09:10

revo