Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

From wildcards to regular expressions

I want to allow the two main wildcards ? and * to filter my data.

Here is how I'm doing now (as I saw on many websites):

public boolean contains(String data, String filter) {
    if(data == null || data.isEmpty()) {
        return false;
    }
    String regex = filter.replace(".", "[.]")
                         .replace("?", ".")
                         .replace("*", ".*");
    return Pattern.matches(regex, data);
}

But shouldn't we escape all the other regex special chars, like | or (, etc.? And also, maybe we could preserve ? and * if they are preceded by a \? For example, something like:

filter.replaceAll("([$|\\[\\]{}(),.+^-])", "\\\\$1") // 1. escape regex special chars, but ?, * and \
      .replaceAll("([^\\\\]|^)\\?", "$1.")           // 2. replace any ? that isn't preceded by a \ by .
      .replaceAll("([^\\\\]|^)\\*", "$1.*")          // 3. replace any * that isn't preceded by a \ by .*
      .replaceAll("\\\\([^?*]|$)", "\\\\\\\\$1");    // 4. replace any \ that isn't followed by a ? or a * (possibly due to step 2 and 3) by \\

What do you think about it? If you agree, am I missing any other regex special char?


Edit #1 (after having taken into account dan1111's and m.buettner's advices):

// replace any even number of backslashes by a *
regex = regex.replaceAll("(?<!\\\\)(\\\\\\\\)+(?!\\\\)", "*");
// reduce redundant wildcards that aren't preceded by a \
regex = regex.replaceAll("(?<!\\\\)[?]*[*][*?]+", "*");
// escape regexps special chars, but \, ? and *
regex = regex.replaceAll("([|\\[\\]{}(),.^$+-])", "\\\\$1");
// replace ? that aren't preceded by a \ by .
regex = regex.replaceAll("(?<!\\\\)[?]", ".");
// replace * that aren't preceded by a \ by .*
regex = regex.replaceAll("(?<!\\\\)[*]", ".*");

What about this one?


Edit #2 (after having taken into account dan1111's advices):

// replace any even number of backslashes by a *
regex = regex.replaceAll("(?<!\\\\)(\\\\\\\\)+(?!\\\\)", "*");
// reduce redundant wildcards that aren't preceded by a \
regex = regex.replaceAll("(?<!\\\\)[?]*[*][*?]+", "*");
// escape regexps special chars (if not already escaped by user), but \, ? and *
regex = regex.replaceAll("(?<!\\\\)([|\\[\\]{}(),.^$+-])", "\\\\$1");
// replace ? that aren't preceded by a \ by .
regex = regex.replaceAll("(?<!\\\\)[?]", ".");
// replace * that aren't preceded by a \ by .*
regex = regex.replaceAll("(?<!\\\\)[*]", ".*");

Goal in sight?

like image 348
sp00m Avatar asked Nov 04 '22 08:11

sp00m


1 Answers

You don't need 4 backslashes in the replacement string to write out a single one. Two backslashes are enough.

And you can avoid the ([^\\\\]|^) and the $1 in the replacement string by using a negative lookbehind:

filter.replaceAll("([$|\\[\\]{}(),.+^-])", "\\$1") // 1. escape regex special chars, but ?, * and \
      .replaceAll("(?<!\\\\)[?]", ".")           // 2. replace any ? that isn't preceded by a \ by .
      .replaceAll("(?<!\\\\)[*]", ".*")          // 3. replace any * that isn't preceded by a \ by .*

I don't really see what you need the last step for. Wouldn't that escape the backslashes that escape your meta-characters (in turn, actually not escaping them). I'm ignoring the fact that your replacement call would have written out 4 backslashes instead of only two. But say your original input had th|is. Then your first replacement would make that th\|is. Then the last replacement would make that th\\|is which matches either th-backslash or is.

You need to differentiate between how your string looks written in code (uncompiled, with twice as many backslashes) and how it looks after it was compiled (containing only half the amount of backslashes).

You might also want to think about restricting the number of possible *. A regex like .*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*! (where ! can not be found in the input) can take quite a while to run. The issue is called catastrophic backtracking.

like image 177
Martin Ender Avatar answered Nov 12 '22 13:11

Martin Ender