Consider this regex.
a*b
This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
This takes 67
steps in debugger to fail.
Now consider this regex.
(?>a*)b
This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
This takes 133
steps in debugger to fail.
And lastly this regex:
a*+b (a variant of atomic group)
This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
This takes 67
steps in debugger to fail.
When I check the benchmark atomic group (?>a*)b
performs 179%
faster.
Now atomic groups disable backtracking. So performance in match is good.
But why are the number of steps more? Can somebody explain on this?
Why is there a diff. in steps between two atomic groups (?>a*)b
and a*+b
.
Do they work differently?
Author note:
This answer targets question 1 as delivered by the bounty text "I am looking forward to the exact reason why more steps are being needed by the debugger.I dont need answers explaining how atomic groups work.";
Jerry's answer addresses the other concerns very well, while my other answer takes a ride through the mentioned constructs, how they work, and why they are important. For full knowledge, simply reading this post is not enough!
WHAT?!
Yeah, I'm serious, read on...
Firstly, I would like to present you with quantified non-capturing groups, over without the group:
Pattern 1: (?:c)at
Pattern 2: cat
So what exactly happens here? We'll match the patterns with the test string "concat"
on a regex engine with optimizations disabled:
While we're at it, I present you some more groups:
Oh no! I'm going to avoid using groups!
But wait! Please note that the number of steps taken to match has no correlation with the performance of the match. pcre engines optimizes away most of the "unnecessary steps" as I've mentioned. Atomic groups are still the most efficient, despite more steps taken on an engine with optimizations disabled.
Perhaps relevant:
I don't know how you did your benchmarking but both a*+b
and (?>a*)b
should be the same. To quote regular-expressions.info (emphasis mine):
Basically, instead of
X*+
, write(?>X*)
. It is important to notice that both the quantified tokenX
and the quantifier are inside the atomic group. Even ifX
is a group, you still need to put an extra atomic group around it to achieve the same effect.(?:a|b)*+
is equivalent to(?>(?:a|b)*)
but not to(?>a|b)*
. The latter is a valid regular expression, but it won't have the same effect when used as part of a larger regular expression.
And just to confirm the above, I ran the following on ideone:
$tests = 1000000;
$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
preg_match('/a*b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );
printf( "For /a*b/ : %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );
$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
preg_match('/(?>a*)b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );
printf( "For /(?>a*)b/: %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );
$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
preg_match('/a*+b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );
printf( "For /a*+b/ : %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );
$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
preg_match('/(?>a)*b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );
printf( "For /(?>a)*b/: %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );
Getting this as the output:
For /a*b/ : 0.000000879034996 per iteration for 1000000 iterations
For /(?>a*)b/: 0.000000876362085 per iteration for 1000000 iterations
For /a*+b/ : 0.000000880002022 per iteration for 1000000 iterations
For /(?>a)*b/: 0.000000883045912 per iteration for 1000000 iterations
Now, I am by no means a PHP expert so I don't know if this is the right way to benchmark stuff in there, but that's they all have about the same performance, which is kind of expected given the simplicity of the task.
Still, couple of things I note from the above:
Neither (?>a)*b
nor (?>a*)b
are 179% faster than another regex; the above all are within 7% from each other.
But why are the number of steps more? Can somebody explain on this?
It is to be noted that the number of steps is not a direct representation of the performance of a regex. It is a factor, but not the ultimate determinant. There are more steps because the breakdown has steps before entering the group, and after entering the group...
1 / (?> a* ) b/x aaaaaaaaaaaaaaaaaaaa...
^
2 / (?> a* ) b/x aaaaaaaaaaaaaaaaaaaa...
^^^^^^^^
3 / (?> a* ) b/x aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
^^
4 / (?> a* ) b/x aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
^
5 / (?> a* ) b/x aaaaaaaaaaaaaaaaaaaaa...
^
That's 2 more steps because of the group than the 3 steps...
1 / a*+ b /x aaaaaaaaaaaaaaaaaaaa...
^
2 / a*+ b /x aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
^^^
3 / a*+ b /x aaaaaaaaaaaaaaaaaaaaa...
^
Where you can say that a*b
is the same as (?:a*)b
but the latter having more steps:
1 / (?: a* ) b/x aaaaaaaaaaaaaaaaaaaa...
^
2 / (?: a* ) b/x aaaaaaaaaaaaaaaaaaaa...
^^^^^^^^
3 / (?: a* ) b/x aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
^^
4 / (?: a* ) b/x aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
^
5 / (?: a* ) b/x aaaaaaaaaaaaaaaaaaaaa...
Note: Even there, you see that regex101 has the steps optimised a bit for the number of steps in a*b
.
Possessive quantifiers and atomic groups work differently depending on how you use them. If I take regular-expression's example and tweaking it a little:
(?>a|b)*ac
match aabaac
But
(?:a|b)*+ac
and (?>(?:a|b)*)ac
don't match aabaac
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With