Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is the order of a preg_offset_capture sub-array *guaranteed* to be increasing offset?

Tags:

arrays

regex

php

I haven't been able to find an authoritative answer to this, though I'm 99.9% certain it's true. Things like the accepted answer to rely on it's being true, as I expect lots of other code does. But can someone who really knows something about preg_match_all (not by observation but by specified requirement or specified algorithm) confirm that this is guaranteed behavior? I can't glean it from the documentation.

My use case is very simple:

preg_match_all("/$regexp/", $content, $matches, PREG_OFFSET_CAPTURE);

And I know that $regexp does not contain any sub-patterns, so the documentation tells me that $matches[0] will be an array of 2-element arrays, where each sub-array has elements with numeric key 0 containing a string that matched the pattern, and numeric key 1 containing the offset into $content at which the match occurred. And while it only seems reasonable that the array elements would be ordered by increasing offset, I don't see where that is required, such that it would be a bug if it weren't the case. Although I can't imagine just how it could be done to useful effect, perhaps there might be some way to implement preg_match_all with multiple threads that append their partial results without merging into fully-sorted order.

In my particular case, I only care about offsets, not the strings that were matched, but it's critical that the offsets be increasing. So with belt-and-suspenders mentality I coded:

preg_match_all("/$regexp/", $content, $matches, PREG_OFFSET_CAPTURE);
$offsets = array();
foreach ($matches as $match) {
    $offsets[] = $match[1];
}
sort($offsets);

So put another way, is the final sort($offsets) a guaranteed waste of cycles?

And if it won't get me in deep trouble for asking a related but potentially separate question, if the sort were potentially useful, would it be more/less/same efficient to take the default SORT_REGULAR flag as shown, or to specify SORT_NUMERIC explicitly, given that the offsets produced within preg_match_all are necessarily numeric?

like image 335
sootsnoot Avatar asked Jun 12 '14 17:06

sootsnoot


1 Answers

Regarding your question on string offset order:

Full matches should always be in ascending string offset order. PHP implements global matching with a loop that sets the start_offset at the end of the most recent full match until the end of the subject string. That is, it finds the first match, then the second, then the third, and so on.

If you'd like to verify that I'm not horribly misreading the source code (or missing something significant), you can look at the function php_pcre_match_impl in ext/pcre/php_pcre.c. preg_match_all sets the global parameter to 1. What clued me into it was a comment at the end of the do while loop for global:

/*Advance to the position right after the last full match*/
start_offset = offsets[1];

If global is set, then the loop repeats with the new offset and pcre_exec is called again with it.

Regarding your SORT_NUMERIC question:

It's hard to say. Setting SORT_NUMERIC makes sort use the numeric_compare_function to do element comparisons, where SORT_REGULAR uses compare_function.

compare_function does a type check and then decides what to do for the comparison from there, whereas numeric_compare_function just blindly converts both into doubles. With both being LONGs compare_function just compares them without doing any kind of conversion. So it would ultimately depend on which is faster: blindly converting to double, or performing the type checking.

like image 127
EPB Avatar answered Sep 20 '22 11:09

EPB