I wrote a pretty simple preg_match_all file in PHP:
$fileName = 'A_DATED_FILE_091410.txt';
$matches = array();
preg_match_all('/[0-9][0-9]/',$fileName,$matches);
print_r($matches);
My Expected Output:
$matches = array(
[0] => array(
[0] => 09,
[1] => 91,
[2] => 14,
[3] => 41,
[4] => 10
)
)
What I got instead:
$matches = array(
[0] => array(
[0] => 09,
[1] => 14,
[2] => 10
)
)
Now, in this particular use case this was preferable, but I'm wondering why it didn't match the other substrings? Also, is a regex possible that would give me my expected output, and if so, what is it?
With a global regex (which is what preg_match_all
uses), once a match is made, the regex engine continues searching the string from the end of the previous match.
In your case, the regular expression engine starts at the beginning of the string, and advances until the 0
, since that is the first character that matches [0-9]
. It then advances to the next position (9
), and since that matches the second [0-9]
, it takes 09
as a match. When the engine continues matching (since it has not yet reached the end of the string), it advances its position again (to 1
) (and then the above repeats).
See also: First Look at How a Regex Engine Works Internally
If you must get every 2 digit sequence, you can use preg_match
and use offsets to determine where to start capturing from:
$fileName = 'A_DATED_FILE_091410.txt';
$allSequences = array();
$matches = array();
$offset = 0;
while (preg_match('/[0-9][0-9]/', $fileName, $matches, PREG_OFFSET_CAPTURE, $offset))
{
list($match, $offset) = $matches[0];
$allSequences[] = $match;
$offset++; // since the match is 2 digits, we'll start the next match after the first
}
Note that the offset returned with the PREG_OFFSET_CAPTURE
flag is the start of the match.
I've got another solution that will get five matches without having to use offsets, but I'm adding it here just for curiosity, and I probably wouldn't use it myself in production code (it's a somewhat complex regex too). You can use a regex that uses a lookbehind to look for a number before the current position, and captures the number in the lookbehind (in general, lookarounds are non-capturing):
(?<=([0-9]))[0-9]
Let's walk through this regex:
(?<= # open a positive lookbehind
( # open a capturing group
[0-9] # match 0-9
) # close the capturing group
) # close the lookbehind
[0-9] # match 0-9
Because lookarounds are zero-width and do not move the regex position, this regular expression will match 5 times: the engine will advance until the 9
(because that is the first position which satisfies the lookbehind assertion). Since 9
matches [0-9], the engine will take 9
as a match (but because we're capturing in the lookaround, it'll also capture the 0
!). The engine then moves to the 1
. Again, the lookbehind succeeds (and captures), and the 1
is added as a 1st subgroup match (and so on, until the engine hits the end of the string).
When we give this pattern to preg_match_all
, we'll end up with an array that looks like (using the PREG_SET_ORDER
flag to group capturing groups along with the full match):
Array
(
[0] => Array
(
[0] => 9
[1] => 0
)
[1] => Array
(
[0] => 1
[1] => 9
)
[2] => Array
(
[0] => 4
[1] => 1
)
[3] => Array
(
[0] => 1
[1] => 4
)
[4] => Array
(
[0] => 0
[1] => 1
)
)
Note that each "match" has its digits out of order! This is because the capture group in the lookbehind becomes backreference 1 while the whole match is backreference 0. We can put it back together in the correct order though:
preg_match_all('/(?<=([0-9]))[0-9]/', $fileName, $matches, PREG_SET_ORDER);
$allSequences = array();
foreach ($matches as $match)
{
$allSequences[] = $match[1] . $match[0];
}
The search for the next match starts at the first character after the previous match. So when 09
is matched in 091410
, the search for the next match starts at 1410
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With