Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching optional string in regex

Tags:

regex

php

I have a problem with matching optional pattern groups in regex. Metacharacters * and + are greedy, so I thought metacharacter ? would also be greedy, but it doesn't seem to function like I thought.

Theoretically I assumed that if we chose to make a pattern group optional, if the pattern group is found in the string, it will be returned in the match results, if it isn't found we will still get overall match results, but with this match missing in the results.

What actually happens is if my pattern is matched in the string, it isnt included in the match results, regex seems like it notices that the pattern group is optional and just doesn't bother to even attempt to match it.

If we set up a test and change this optional pattern group to non-optional, regex will include it in the match results, but this is only practical for the test because sometimes this pattern wont be available in the string.

The reason why I need the match included in the results, is because I need the match results for analyzing at a later date.

Encase I have not described this scenario very well, I have setup a very simple example which follows, In PHP.

$string = 'This is a test, Stackoverflow. 2014 Cecili0n';

if(preg_match_all("~(This).*?(Stackoverflow)?~i",$string,$match))
    print_r($match);

Results

Array
(
    [0] => Array
        (
            [0] => This
        )

    [1] => Array
        (
            [0] => This
        )

    [2] => Array
        (
            [0] => 
        )
)

(Stackoverflow)? is the optional pattern, if we run the above code, even though this pattern is available in the string, it will not be returned in the match results.

If we make this pattern group mandatory it will be returned in the results, like in the following.

if(preg_match_all("~(This).*?(Stackoverflow)~i",$string,$match))
    print_r($match);

Results

Array
(
    [0] => Array
        (
            [0] => This
        )

    [1] => Array
        (
            [0] => This
        )

    [2] => Array
        (
            [0] => Stackoverflow
        )
)

How can I achieve this? It is important for me to get accurate data on how the match was found.

Thanks for any thoughts on the matter.

like image 892
cecilli0n Avatar asked Mar 10 '14 14:03

cecilli0n


People also ask

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.

What is \r and \n in regex?

Matches a form-feed character. \n. Matches a newline character. \r. Matches a carriage return character.

What does * do in regex?

The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern. `*' represents this operator. For example, `o*' matches any string made up of zero or more `o' s.


1 Answers

What happens here

This might be surprising, but it is actually expected behavior. Let's break down the regex and translate it to human-readable terms:

(This)               Match "This" literally
.*?                  Match any character **as few times as possible**,
                     while still allowing the rest of the expression to match
(Stackoverflow)?     Match "Stackoverflow" literally **if possible**

So what happens is:

  • The regex engine matches "This".
  • It then has to consider how many characters the *? quantifier should match.
  • Let's assume we match zero characters. Does this allow the rest of the expression to match? In other words, does (Stackoverflow)? match " is a test, Stackoverflow. 2014 Cecili0n"?
  • The subpattern is optional, so it does! Therefore, .*? matches zero characters.
  • What does the final subpattern (Stackoverflow)? match? Obviously nothing at the position where the match is attempted.

End result: both quantified subpatterns match the empty string.

How to get the expected result

If making everything optional won't work, how do you optionally match "Stackoverflow"? By explicitly spelling out the acceptable options to the regex engine:

~(This)(.*?(Stackoverflow)|.*?)~i

This instructs the engine to either match as much as it can followed by the literal "Stackoverflow", or otherwise match as much as it can. By listing the "Stackoverflow included" option first you are assured that if it does exist in the text it will be matched.

Obviously the .*? option does not make too much sense in this example, but I am leaving it as it is because I wanted to describe a "mechanical" transformation that will work regardless of the actual regular expression.

Note that to maintain full equivalence with the original regex the extra group introduced for structural purposes has to be made non-capturing:

~(This)(?:.*?(Stackoverflow)|.*)~i

See it in action.

like image 97
Jon Avatar answered Nov 04 '22 11:11

Jon