Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is regex too slow? Real life examples where simple non-regex alternative is better

I've seen people here made comments like "regex is too slow!", or "why would you do something so simple using regex!" (and then present a 10+ lines alternative instead), etc.

I haven't really used regex in industrial setting, so I'm curious if there are applications where regex is demonstratably just too slow, AND where a simple non-regex alternative exists that performs significantly (maybe even asymptotically!) better.

Obviously many highly-specialized string manipulations with sophisticated string algorithms will outperform regex easily, but I'm talking about cases where a simple solution exists and significantly outperforms regex.

What counts as simple is subjective, of course, but I think a reasonable standard is that if it uses only String, StringBuilder, etc, then it's probably simple.


Note: I would very much appreciate answers that demonstrate the following:

  1. a beginner-level regex solution to a non-toy real-life problem that performs horribly
  2. the simple non-regex solution
  3. the expert-level regex rewrite that performs comparably
like image 464
polygenelubricants Avatar asked Apr 19 '10 11:04

polygenelubricants


People also ask

Is regex too slow?

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line. Show activity on this post.

Does regex affect performance?

Being more specific with your regular expressions, even if they become much longer, can make a world of difference in performance. The fewer characters you scan to determine the match, the faster your regexes will be.

Is there anything faster than regex?

String operations will always be faster than regular expression operations. Unless, of course, you write the string operations in an inefficient way. Regular expressions have to be parsed, and code generated to perform the operation using string operations.


2 Answers

I remember a textbook example of a regex gone bad. Be aware that none of the following approaches is recommended for production use! Use a proper CSV parser instead.

The mistake made in this example is quite common: Using a dot where a narrower character class is better suited.

In a CSV file containing on each line exactly 12 integers separated by commas, find the lines that have a 13 in the 6th position (no matter where else a 13 may be).

1, 2, 3, 4, 5, 6, 7, 8 ,9 ,10,11,12 // don't match 42,12,13,12,32,13,14,43,56,31,78,10 // match 42,12,13,12,32,14,13,43,56,31,78,10 // don't match 

We use a regex containing exactly 11 commas:

".*,.*,.*,.*,.*,13,.*,.*,.*,.*,.*,.*" 

This way, each ".*" is confined to a single number. This regex solves the task, but has very bad performance. (Roughly 600 microseconds per string on my computer, with little difference between matched and unmatched strings.)

A simple non-regex solution would be to split() each line and compare the 6th element. (Much faster: 9 microseconds per string.)

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ".*" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

So we replace the greedy quantifier with the reluctant one:

".*?,.*?,.*?,.*?,.*?,13,.*?,.*?,.*?,.*?,.*?,.*?" 

This performs way better for a matched string (by a factor of 100), but has almost unchanged performance for a non-matched string.

A performant regex replaces the dot by the character class "[^,]":

"[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,13,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*" 

(This needs 3.7 microseconds per string for the matched string and 2.4 for the unmatched strings on my computer.)

like image 97
Christian Semrau Avatar answered Sep 24 '22 18:09

Christian Semrau


I experimented a bit with the performance of various constructs, and unfortunately I discovered that Java regex doesn't perform what I consider very doable optimizations.

Java regex takes O(N) to match "(?s)^.*+$"

This is very disappointing. It's understandable for ".*" to take O(N), but with the optimization "hints" in the form of anchors (^ and $) and single-line mode Pattern.DOTALL/(?s), even making the repetition possessive (i.e. no backtracking), the regex engine still could not see that this will match every string, and still have to match in O(N).

This pattern isn't very useful, of course, but consider the next problem.

Java regex takes O(N) to match "(?s)^A.*Z$"

Again, I was hoping that the regex engine can see that thanks to the anchors and single-line mode, this is essentially the same as the O(1) non-regex:

 s.startsWith("A") && s.endsWith("Z") 

Unfortunately, no, this is still O(N). Very disappointing. Still, not very convincing because a nice and simple non-regex alternative exists.

Java regex takes O(N) to match "(?s)^.*[aeiou]{3}$"

This pattern matches strings that ends with 3 lowercase vowels. There is no nice and simple non-regex alternative, but you can still write something non-regex that matches this in O(1), since you only need to check the last 3 characters (for simplicity, we can assume that the string length is at least 3).

I also tried "(?s)^.*$(?<=[aeiou]{3})", in an attempt to tell the regex engine to just ignore everything else, and just check the last 3 characters, but of course this is still O(N) (which follows from the first section above).

In this particular scenario, however, regex can be made useful by combining it with substring. That is, instead of seeing if the whole string matches the pattern, you can manually restrict the pattern to attempt to match only the last 3 characters substring. In general, if you know before hand that the pattern has a finite length maximum match, you can substring the necessary amount of characters from the end of a very long string and regex on just that part.


Test harness

static void testAnchors() {     String pattern = "(?s)^.*[aeiou]{3}$";     for (int N = 1; N < 20; N++) {         String needle = stringLength(1 << N) + "ooo";         System.out.println(N);         boolean b = true;         for (int REPS = 10000; REPS --> 0; ) {             b &=                needle               //.substring(needle.length() - 3) // try with this               .matches(pattern);         }         System.out.println(b);     } } 

The string length in this test grows exponentially. If you run this test, you will find that it starts to really slow down after 10 (i.e. string length 1024). If you uncomment the substring line, however, the entire test will complete in no time (which also confirms that the problem is not because I didn't use Pattern.compile, which would yield a constant improvement at best, but rather because the patttern takes O(N) to match, which is problematic when the asymptotic growth of N is exponential).


Conclusion

It seems that Java regex does little to no optimization based on the pattern. Suffix matching in particular is especially costly, because the regex still needs to go through the entire length of the string.

Thankfully, doing the regex on the chopped suffix using substring (if you know the maximum length of the match) can still allow you to use regex for suffix matching in time independent of the length of the input string.

//update: actually I just realized that this applies to prefix matching too. Java regex matches a O(1) length prefix pattern in O(N). That is, "(?s)^[aeiou]{3}.*$" checks if a string starts with 3 lowercase letters in O(N) when it should be optimizable to O(1).

I thought prefix matching would be more regex-friendly, but I don't think it's possible to come up with a O(1)-runtime pattern to match the above (unless someone can prove me wrong).

Obviously you can do the s.substring(0, 3).matches("(?s)^[aeiou]{3}.*$") "trick", but the pattern itself is still O(N); you've just manually reduced N to a constant by using substring.

So for any kind of finite-length prefix/suffix matching of a really long string, you should preprocess using substring before using regex; otherwise it's O(N) where O(1) suffices.

like image 37
polygenelubricants Avatar answered Sep 21 '22 18:09

polygenelubricants