Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is performance of ^(?:x+y){5}$ slower than ^x+yx+yx+yx+yx+y$

I let the following compiled regular expressions match on a bunch of strings, both in .net (N) and Java (J). Through multiple runs there are consistent differences between regex 1 and regex 2, both in .net and in Java:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  #  regex                             N secs   N x  J secs   J x 
──────────────────────────────────────────────────────────────────
  1  ^[^@]+@[^@]+@[^@]+@[^@]+@[^@]+@$    8.10     1    5.67     1 
  2  ^(?:[^@]+@){5}$                    11.07  1.37    6.48  1.14 
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Could and should regex compilers not unroll and otherwise normalize equivalent constructs into the form that performs best?

If they "could and should", it whould at least be possible to write a regex optimizer, which modifies the regex string before it gets compiled.

The crucial bits of the code used:

.net

// init
regex = new Regex(r, RegexOptions.Compiled | RegexOptions.CultureInvariant);

// test
sw = Stopwatch.Start();
foreach (var s in strs)
  if (regex.isMatch(s))
    matches++;
elapsed = sw.Elapsed;

Java

// init
pat = Pattern.compile(r);

// test
before = System.currentTimeMillis();
for (String s : strs)
  if (pat.matcher(s).matches())
    matches++;
elapsed = System.currentTimeMillis() - before;
like image 659
Evgeniy Berezovsky Avatar asked Jan 10 '23 00:01

Evgeniy Berezovsky


2 Answers

I don't know about .NET, since I haven't studied its source code in detail.

However, in Java, particularly Oracle/Sun implementation, I can say that it is likely due to the the looping structure overhead.

In this answer, whenever I refer to the implementation of regex in Java, it is Oracle/Sun implementation that I'm referring to. I haven't studied other implementation yet, so I can't really say anything.

Greedy quantifier

I just realize that this part has little to do with the question. However, it gives some introduction to how greedy quantifier is implemented this way, so I'll just leave it here.

Given an atom A with a greedy quantifier A* (the number of repetition doesn't really matter here), the greedy quantifier will try to match as many A as possible, then try the sequel (whatever comes after A*), on failure, backtrack one repetition at a time and retry with the sequel.

The problem is where to backtrack to. It is extremely inefficient to rematch the whole thing just to figure out the position, so we need to store the position where a repetition finishes its match at, for every single repetition. The more you repeat, the more memory you need to keep all the states so far for backtracking, not mentioning the capturing groups (if any).

Unrolling the regex as done in the question doesn't escape the memory requirement above.

Simple cases with fixed-length atom

However, for simple cases such as [^@]* where you know the atom A (in this case [^@]) can only match fixed length string (length 1), only the last match position and the length of the match are needed to perform the match efficiently. Java's implementation includes a study method to detect fixed length patterns like these to compile to a loop implementation (Pattern.Curly and Pattern.GroupCurly).

This is how the first regex ^[^@]+@[^@]+@[^@]+@[^@]+@[^@]+@$ looks like after it is compiled into a chain of nodes inside Pattern class:

Begin. \A or default ^
Curly. Greedy quantifier {1,2147483647}
  CharProperty.complement. S̄:
    BitClass. Match any of these 1 character(s):
      @
  Node. Accept match
Single. Match code point: U+0040 COMMERCIAL AT
Curly. Greedy quantifier {1,2147483647}
  CharProperty.complement. S̄:
    BitClass. Match any of these 1 character(s):
      @
  Node. Accept match
Single. Match code point: U+0040 COMMERCIAL AT
Curly. Greedy quantifier {1,2147483647}
  CharProperty.complement. S̄:
    BitClass. Match any of these 1 character(s):
      @
  Node. Accept match
Single. Match code point: U+0040 COMMERCIAL AT
Curly. Greedy quantifier {1,2147483647}
  CharProperty.complement. S̄:
    BitClass. Match any of these 1 character(s):
      @
  Node. Accept match
Single. Match code point: U+0040 COMMERCIAL AT
Curly. Greedy quantifier {1,2147483647}
  CharProperty.complement. S̄:
    BitClass. Match any of these 1 character(s):
      @
  Node. Accept match
Single. Match code point: U+0040 COMMERCIAL AT
Dollar(multiline=false). \Z or default $
java.util.regex.Pattern$LastNode
Node. Accept match

Looping structure overhead

In cases where the length is not fixed, like the atom (?:[^@]+@) in the regex ^(?:[^@]+@){5}$ in the question, Java's implementation switches over to recursion to handle the matching (Pattern.Loop).

Begin. \A or default ^
Prolog. Loop wrapper
Loop[1733fe5d]. Greedy quantifier {5,5}
  java.util.regex.Pattern$GroupHead
  Curly. Greedy quantifier {1,2147483647}
    CharProperty.complement. S̄:
      BitClass. Match any of these 1 character(s):
        @
    Node. Accept match
  Single. Match code point: U+0040 COMMERCIAL AT
  GroupTail. --[next]--> Loop[1733fe5d]
Dollar(multiline=false). \Z or default $
java.util.regex.Pattern$LastNode
Node. Accept match

This incurs extra overhead on every repetition to go through the nodes GroupTail --> Loop --> GroupHead.

You may ask why the implementation does this even though the number of repetitions is fixed. Since the number of repetition is fixed, by right, there is nowhere to backtrack at all, so can't we just keep track of the state before the repetition and the state in the current repetition?

Well, here is a counterexample: ^(?:a{1,5}?){5}$ on a string length 15 of only a's. The backtracking can still happen inside the atom, so we need to store the match position of every repetition as per usual.

Actual timing

What I have discussed above are all at the Java source code (and thus bytecode) level. While the source code may reveal certain problem in the implementation, the performance ultimately depends on how the JVM generates the machine code and performs optimization.

This is the source code I used for testing:

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Random;
import java.util.regex.Pattern;


public class SO28161874 {
    // Makes sure the same set of strings is generated between different runs
    private static Random r = new Random();

    public static void main(String args[]) {
        final int rep = 5;

        // String r1 = "^[^@]+@[^@]+@[^@]+@[^@]+@[^@]+@$";
        String r1 = genUnroll(rep);
        // String r2 = "^(?:[^@]+@){5}$";
        String r2 = genQuantifier(rep);

        List<String> strs = new ArrayList<String>();
        // strs.addAll(generateRandomString(500, 40000, 0.002, false));
        // strs.addAll(generateRandomString(500, 40000, 0.01, false));
        // strs.addAll(generateRandomString(500, 40000, 0.01, true));
        // strs.addAll(generateRandomString(500, 20000, 0, false));
        // strs.addAll(generateRandomString(500, 40000, 0.002, true));
        strs.addAll(generateNearMatchingString(500, 40000, rep));

        /*
        // Assertion for generateNearMatchingString
        for (String s: strs) {
            assert(s.matches(r1.replaceAll("[+]", "*")));
        }
        */

        System.out.println("Test string generated");

        System.out.println(r1);
        System.out.println(test(Pattern.compile(r1), strs));

        System.out.println(r2);
        System.out.println(test(Pattern.compile(r2), strs));
    }

    private static String genUnroll(int rep) {
        StringBuilder out = new StringBuilder("^");

        for (int i = 0; i < rep; i++) {
            out.append("[^@]+@");
        }

        out.append("$");
        return out.toString();
    }

    private static String genQuantifier(int rep) {
        return "^(?:[^@]+@){" + rep + "}$";
    }

    /*
     * count -- number of strings
     * maxLength -- maximum length of the strings
     * chance -- chance that @ will appear in the string, from 0 to 1
     * end -- the string appended with @
     */
    private static List<String> generateRandomString(int count, int maxLength, double chance, boolean end) {
        List<String> out = new ArrayList<String>();

        for (int i = 0; i < count; i++) {
            StringBuilder sb = new StringBuilder();
            int length = r.nextInt(maxLength);
            for (int j = 0; j < length; j++) {
                if (r.nextDouble() < chance) {
                    sb.append('@');
                } else {
                    char c = (char) (r.nextInt(96) + 32);
                    if (c != '@') {
                        sb.append(c);
                    } else {
                        j--;
                    }
                }
            }

            if (end) {
                sb.append('@');
            }

            out.add(sb.toString());

        }

        return out;
    }

    /*
     * count -- number of strings
     * maxLength -- maximum length of the strings
     * rep -- number of repetitions of @
     */
    private static List<String> generateNearMatchingString(int count, int maxLength, int rep) {
        List<String> out = new ArrayList<String>();

        int pos[] = new int[rep - 1]; // Last @ is at the end

        for (int i = 0; i < count; i++) {
            StringBuilder sb = new StringBuilder();
            int length = r.nextInt(maxLength);

            for (int j = 0; j < pos.length; j++) {
                pos[j] = r.nextInt(length);
            }
            Arrays.sort(pos);

            int p = 0;

            for (int j = 0; j < length - 1; j++) {
                if (p < pos.length && pos[p] == j) {
                    sb.append('@');
                    p++;
                } else {
                    char c = (char) (r.nextInt(95) + 0x20);
                    if (c != '@') {
                        sb.append(c);
                    } else {
                        j--;
                    }
                }
            }

            sb.append('@');

            out.add(sb.toString());

        }

        return out;
    }

    private static long test(Pattern re, List<String> strs) {
        int matches = 0;

        // 500 rounds warm-up
        for (int i = 0; i < 500; i++) {
            for (String s : strs)
              if (re.matcher(s).matches());
        }

        long accumulated = 0;

        long before = System.currentTimeMillis();
        for (int i = 0; i < 1000; i++) {
            matches = 0;
            for (String s : strs)
              if (re.matcher(s).matches())
                matches++;
        }

        accumulated += System.currentTimeMillis() - before;

        System.out.println("Found " + matches + " matches");

        return accumulated;
    }
}

Comment/uncomment different generate lines to test, and mess around with the numbers. I warm up the VM before each test by executing the regex 500 times before timing the accumulated time to run the regex 1000 times.

I don't have any concrete number to post, since I find my own result rather unstable. However, from my testing, I generally find the first regex faster than the second regex.

By generating 500 strings, each string can be up to 40000 characters long, I find that the difference between the 2 regexes is more prominent (around 1 to 2 seconds) when the input causes them to run in less than 10 seconds. When the input causes them to run longer (40+ seconds), both regex have more or less the same run time, with at most a few hundred of milliseconds difference.

like image 64
nhahtdh Avatar answered Jan 11 '23 15:01

nhahtdh


Why is performance of ^(?:x+y){5}$ slower than ^x+yx+yx+yx+yx+y$ ?

Because here comes the concept of Backtracking.

Regex 1:

^[^@]+@[^@]+@[^@]+@[^@]+@[^@]+@$

Here there is no backtracking because each individual pattern matches a particular part of a string.

enter image description here

Regex 2:

^(?:[^@]+@){5}$

enter image description here

like image 39
Avinash Raj Avatar answered Jan 11 '23 14:01

Avinash Raj