Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does this regex have 3 matches, not 5?

Tags:

regex

php

I wrote a pretty simple preg_match_all file in PHP:

$fileName = 'A_DATED_FILE_091410.txt';
$matches = array();
preg_match_all('/[0-9][0-9]/',$fileName,$matches);
print_r($matches);

My Expected Output:

$matches = array(
    [0] => array(
        [0] => 09,
        [1] => 91,
        [2] => 14,
        [3] => 41,
        [4] => 10
    )
)

What I got instead:

$matches = array(
    [0] => array(
        [0] => 09,
        [1] => 14,
        [2] => 10
    )
)

Now, in this particular use case this was preferable, but I'm wondering why it didn't match the other substrings? Also, is a regex possible that would give me my expected output, and if so, what is it?

like image 787
GSto Avatar asked Dec 21 '22 23:12

GSto


2 Answers

With a global regex (which is what preg_match_all uses), once a match is made, the regex engine continues searching the string from the end of the previous match.

In your case, the regular expression engine starts at the beginning of the string, and advances until the 0, since that is the first character that matches [0-9]. It then advances to the next position (9), and since that matches the second [0-9], it takes 09 as a match. When the engine continues matching (since it has not yet reached the end of the string), it advances its position again (to 1) (and then the above repeats).

See also: First Look at How a Regex Engine Works Internally


If you must get every 2 digit sequence, you can use preg_match and use offsets to determine where to start capturing from:

$fileName = 'A_DATED_FILE_091410.txt';
$allSequences = array();
$matches = array();
$offset = 0;

while (preg_match('/[0-9][0-9]/', $fileName, $matches, PREG_OFFSET_CAPTURE, $offset))
{
  list($match, $offset) = $matches[0];
  $allSequences[] = $match;
  $offset++; // since the match is 2 digits, we'll start the next match after the first
}

Note that the offset returned with the PREG_OFFSET_CAPTURE flag is the start of the match.


I've got another solution that will get five matches without having to use offsets, but I'm adding it here just for curiosity, and I probably wouldn't use it myself in production code (it's a somewhat complex regex too). You can use a regex that uses a lookbehind to look for a number before the current position, and captures the number in the lookbehind (in general, lookarounds are non-capturing):

(?<=([0-9]))[0-9]

Let's walk through this regex:

(?<=       # open a positive lookbehind
  (        # open a capturing group
    [0-9]  # match 0-9
  )        # close the capturing group
)          # close the lookbehind
[0-9]      # match 0-9

Because lookarounds are zero-width and do not move the regex position, this regular expression will match 5 times: the engine will advance until the 9 (because that is the first position which satisfies the lookbehind assertion). Since 9 matches [0-9], the engine will take 9 as a match (but because we're capturing in the lookaround, it'll also capture the 0!). The engine then moves to the 1. Again, the lookbehind succeeds (and captures), and the 1 is added as a 1st subgroup match (and so on, until the engine hits the end of the string).

When we give this pattern to preg_match_all, we'll end up with an array that looks like (using the PREG_SET_ORDER flag to group capturing groups along with the full match):

Array
(
    [0] => Array
        (
            [0] => 9
            [1] => 0
        )

    [1] => Array
        (
            [0] => 1
            [1] => 9
        )

    [2] => Array
        (
            [0] => 4
            [1] => 1
        )

    [3] => Array
        (
            [0] => 1
            [1] => 4
        )

    [4] => Array
        (
            [0] => 0
            [1] => 1
        )

)

Note that each "match" has its digits out of order! This is because the capture group in the lookbehind becomes backreference 1 while the whole match is backreference 0. We can put it back together in the correct order though:

preg_match_all('/(?<=([0-9]))[0-9]/', $fileName, $matches, PREG_SET_ORDER);
$allSequences = array();
foreach ($matches as $match)
{
  $allSequences[] = $match[1] . $match[0];
}
like image 167
Daniel Vandersluis Avatar answered Jan 02 '23 17:01

Daniel Vandersluis


The search for the next match starts at the first character after the previous match. So when 09 is matched in 091410, the search for the next match starts at 1410.

like image 43
Gumbo Avatar answered Jan 02 '23 16:01

Gumbo