I haven't been able to find an authoritative answer to this, though I'm 99.9% certain it's true. Things like the accepted answer to rely on it's being true, as I expect lots of other code does. But can someone who really knows something about preg_match_all
(not by observation but by specified requirement or specified algorithm) confirm that this is guaranteed behavior? I can't glean it from the documentation.
My use case is very simple:
preg_match_all("/$regexp/", $content, $matches, PREG_OFFSET_CAPTURE);
And I know that $regexp
does not contain any sub-patterns, so the documentation tells me that $matches[0]
will be an array of 2-element arrays, where each sub-array has elements with numeric key 0 containing a string that matched the pattern, and numeric key 1 containing the offset into $content
at which the match occurred. And while it only seems reasonable that the array elements would be ordered by increasing offset, I don't see where that is required, such that it would be a bug if it weren't the case. Although I can't imagine just how it could be done to useful effect, perhaps there might be some way to implement preg_match_all
with multiple threads that append their partial results without merging into fully-sorted order.
In my particular case, I only care about offsets, not the strings that were matched, but it's critical that the offsets be increasing. So with belt-and-suspenders mentality I coded:
preg_match_all("/$regexp/", $content, $matches, PREG_OFFSET_CAPTURE);
$offsets = array();
foreach ($matches as $match) {
$offsets[] = $match[1];
}
sort($offsets);
So put another way, is the final sort($offsets)
a guaranteed waste of cycles?
And if it won't get me in deep trouble for asking a related but potentially separate question, if the sort were potentially useful, would it be more/less/same efficient to take the default SORT_REGULAR
flag as shown, or to specify SORT_NUMERIC
explicitly, given that the offsets produced within preg_match_all
are necessarily numeric?
Regarding your question on string offset order:
Full matches should always be in ascending string offset order. PHP implements global matching with a loop that sets the start_offset
at the end of the most recent full match until the end of the subject string. That is, it finds the first match, then the second, then
the third, and so on.
If you'd like to verify that I'm not horribly misreading the source code (or missing something significant), you can look at the function
php_pcre_match_impl
in ext/pcre/php_pcre.c
. preg_match_all
sets the global parameter to 1. What clued me into it was a comment at the end
of the do while loop for global
:
/*Advance to the position right after the last full match*/
start_offset = offsets[1];
If global is set, then the loop repeats with the new offset and pcre_exec
is called again with it.
Regarding your SORT_NUMERIC question:
It's hard to say. Setting SORT_NUMERIC
makes sort use the numeric_compare_function
to do element comparisons, where SORT_REGULAR
uses compare_function
.
compare_function
does a type check and then decides what to do for the comparison from there, whereas numeric_compare_function
just blindly converts both into
double
s. With both being LONGs compare_function
just compares them without doing any kind of conversion. So it would ultimately depend on which is faster: blindly converting to double, or performing the type checking.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With