Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Atomic groups clarity

Consider this regex.

a*b

This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac

This takes 67 steps in debugger to fail.

Now consider this regex.

(?>a*)b

This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac

This takes 133 steps in debugger to fail.

And lastly this regex:

a*+b  (a variant of atomic group)

This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac

This takes 67 steps in debugger to fail.

When I check the benchmark atomic group (?>a*)b performs 179% faster.

Now atomic groups disable backtracking. So performance in match is good.

  1. But why are the number of steps more? Can somebody explain on this?

  2. Why is there a diff. in steps between two atomic groups (?>a*)b and a*+b.

Do they work differently?

like image 452
vks Avatar asked Sep 29 '14 06:09

vks


2 Answers

Author note:

    This answer targets question 1 as delivered by the bounty text "I am looking forward to the exact reason why more steps are being needed by the debugger.I dont need answers explaining how atomic groups work.";
    Jerry's answer addresses the other concerns very well, while my other answer takes a ride through the mentioned constructs, how they work, and why they are important. For full knowledge, simply reading this post is not enough!

Every group in a regular expression takes a step to step into and out of the group.

    WHAT?!
Yeah, I'm serious, read on...

Firstly, I would like to present you with quantified non-capturing groups, over without the group:

Pattern 1: (?:c)at
Pattern 2: cat

So what exactly happens here? We'll match the patterns with the test string "concat" on a regex engine with optimizations disabled:

(?:c)atcat

While we're at it, I present you some more groups:

groups

    Oh no! I'm going to avoid using groups!

But wait! Please note that the number of steps taken to match has no correlation with the performance of the match. pcre engines optimizes away most of the "unnecessary steps" as I've mentioned. Atomic groups are still the most efficient, despite more steps taken on an engine with optimizations disabled.

Perhaps relevant:

  • Why is a character class faster than alternation?
like image 63
Unihedron Avatar answered Oct 23 '22 09:10

Unihedron


I think something is wrong...

I don't know how you did your benchmarking but both a*+b and (?>a*)b should be the same. To quote regular-expressions.info (emphasis mine):

Basically, instead of X*+, write (?>X*). It is important to notice that both the quantified token X and the quantifier are inside the atomic group. Even if X is a group, you still need to put an extra atomic group around it to achieve the same effect. (?:a|b)*+ is equivalent to (?>(?:a|b)*) but not to  (?>a|b)*. The latter is a valid regular expression, but it won't have the same effect when used as part of a larger regular expression.

And just to confirm the above, I ran the following on ideone:

$tests = 1000000;

$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
    preg_match('/a*b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );

printf( "For /a*b/    : %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );

$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
    preg_match('/(?>a*)b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );

printf( "For /(?>a*)b/: %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );

$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
    preg_match('/a*+b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );

printf( "For /a*+b/   : %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );

$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
    preg_match('/(?>a)*b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );

printf( "For /(?>a)*b/: %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );

Getting this as the output:

For /a*b/    : 0.000000879034996 per iteration for 1000000 iterations
For /(?>a*)b/: 0.000000876362085 per iteration for 1000000 iterations
For /a*+b/   : 0.000000880002022 per iteration for 1000000 iterations
For /(?>a)*b/: 0.000000883045912 per iteration for 1000000 iterations

Now, I am by no means a PHP expert so I don't know if this is the right way to benchmark stuff in there, but that's they all have about the same performance, which is kind of expected given the simplicity of the task.

Still, couple of things I note from the above:

Neither (?>a)*b nor (?>a*)b are 179% faster than another regex; the above all are within 7% from each other.

Back to the actual question

But why are the number of steps more? Can somebody explain on this?

It is to be noted that the number of steps is not a direct representation of the performance of a regex. It is a factor, but not the ultimate determinant. There are more steps because the breakdown has steps before entering the group, and after entering the group...

1   / (?> a* ) b/x    aaaaaaaaaaaaaaaaaaaa...
     ^
2   / (?> a* ) b/x    aaaaaaaaaaaaaaaaaaaa...
      ^^^^^^^^
3   / (?> a* ) b/x    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
          ^^
4   / (?> a* ) b/x    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
            ^
5   / (?> a* ) b/x    aaaaaaaaaaaaaaaaaaaaa...
               ^

That's 2 more steps because of the group than the 3 steps...

1   / a*+ b /x    aaaaaaaaaaaaaaaaaaaa...
     ^
2   / a*+ b /x    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
      ^^^
3   / a*+ b /x    aaaaaaaaaaaaaaaaaaaaa...
          ^

Where you can say that a*b is the same as (?:a*)b but the latter having more steps:

1   / (?: a* ) b/x    aaaaaaaaaaaaaaaaaaaa...
     ^
2   / (?: a* ) b/x    aaaaaaaaaaaaaaaaaaaa...
      ^^^^^^^^
3   / (?: a* ) b/x    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
          ^^
4   / (?: a* ) b/x    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
            ^
5   / (?: a* ) b/x    aaaaaaaaaaaaaaaaaaaaa...

Note: Even there, you see that regex101 has the steps optimised a bit for the number of steps in a*b.


Conclusion

Possessive quantifiers and atomic groups work differently depending on how you use them. If I take regular-expression's example and tweaking it a little:

(?>a|b)*ac match aabaac

But

(?:a|b)*+ac and (?>(?:a|b)*)ac don't match aabaac.

like image 35
Jerry Avatar answered Oct 23 '22 08:10

Jerry