Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Regular Expression Two Question marks (??)

Tags:

regex

I know that /? means the / is optional. so "toys?" will match both toy and toys. My understanding is that if I make it lazy and use "toys??" I will match both toy and toys and always return toy. So, a quick test:

private final static Pattern TEST_PATTERN = Pattern.compile("toys??", Pattern.CASE_INSENSITIVE);
public static void main(String[] args) {
    for(String arg : args) {
        Matcher m = TEST_PATTERN.matcher(arg);
        System.out.print("Arg: " + arg);
        boolean b = false;
        while (m.find()) {
            System.out.print(" {");
            for (int i=0; i<=m.groupCount(); ++i) {
                System.out.print("[" + m.group(i) + "]");
            }
            System.out.print("}");
        }
        System.out.println();
    }
}

Yep, it looks like it works as expected

java -cp .. regextest.RegExTest toy toys
Arg: toy {[toy]}
Arg: toys {[toy]}

Now, change the regular expression to "toys??2" and it still matches toys2 and toy2. In both cases, it returns the entire string without the s removed. Is there any functional difference between searching for "toys?2" and "toys??2".

The reason I am asking is because I found an example like the following:

private final static Pattern TEST_PATTERN = Pattern.compile("</??tag(\\s+?.*?)??>", Pattern.CASE_INSENSITIVE);

and although I see no apparent reason for using ?? rather than ?, I thought that perhaps the original author (who is not known to me) might know something that I don't, I expect the later.

like image 841
Andrew Avatar asked Jan 22 '14 16:01

Andrew


People also ask

What does ?! Mean in regex?

It's a negative lookahead, which means that for the expression to match, the part within (?!...) must not match. In this case the regex matches http:// only when it is not followed by the current host name (roughly, see Thilo's comment). Follow this answer to receive notifications.

What is question mark in Java regex?

The question mark gives the regex engine two choices: try to match the part the question mark applies to, or do not try to match it. The engine always tries to match that part. Only if this causes the entire regular expression to fail, will the engine try ignoring the part the question mark applies to.

How do you escape a question mark in regex?

But if you want to search a question mark, you need to “escape” the regex interpretation of the question mark. You accomplish this by putting a backslash just before the quesetion mark, like this: \? If you want to match the period character, escape it by adding a backslash before it.

What does question mark colon mean in regex?

The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group. The regex Set(Value)? matches Set or SetValue. In the first case, the first (and only) capturing group remains empty. In the second case, the first capturing group matches Value.


1 Answers

?? is lazy while ? is greedy.

Given (pattern)??, it will first test for empty string, then if the rest of the pattern can't match, it will test for pattern.

In contrast, (pattern)? will test for pattern first, then it will test for empty string on backtrack.


Now, change the regular expression to "toys??2" and it still matches toys2 and toy2. In both cases, it returns the entire string without the s removed. Is there any functional difference between searching for "toys?2" and "toys??2".

The difference is in the order of searching:

  • "toys?2" searches for toys2, then toy2
  • "toys??2" searches for toy2, then toys2

But for the case of these 2 patterns, the result will be the same regardless of the input string, since the sequel 2 (after s? or s??) must be matched.


As for the pattern you found:

Pattern.compile("</??tag(\\s+?.*?)??>", Pattern.CASE_INSENSITIVE)

Both ?? can be changed to ? without affecting the result:

  • / and t (in tag) are mutually exclusive. You either match one or the other.
  • > and \s are also mutually exclusive. The at least 1 in \s+? is important to this conclusion: the result might be different otherwise.

This is probably micro-optimization from the author. He probably thinks that the open tag must be there, while the closing tag might be forgotten, and that open/close tags without attributes/random spaces appears more often than those with some.

By the way, the engine might run into some expensive backtracking attempt due to \\s+?.*? when the input has <tag followed by lots of spaces without > anywhere near.

like image 61
nhahtdh Avatar answered Oct 22 '22 02:10

nhahtdh