Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How good is Oniguruma compared to other cross-platform regexp libraries?

We are trying to get rid of boost::regex and it's awful performance. According to this benchmark, Oniguruma is the best overall.

We have multiple regexps (and always changing) which we apply on strings ranging from medium (100 chars) to huge (1k chars)... so it's a very heterogenous environment.

Have any of you used it with success ? Do you recommend going for the more "standard" ones like PCRE or RE2 ?

Thanks !

like image 660
pyrho Avatar asked Aug 16 '12 23:08

pyrho


2 Answers

the two kinds of implementation (FSA and BT) have quite different behaviours, which you can see in the right-hand column (email) there.

oniguruma is generally fast, but has the possibility of running slowly if you're "unlucky" with a particular regexp. that's because it's a backtracking algorithm.

in contrast, while re2 is generally a little slower, it doesn't have the same risk - its time will never[*] explode in the same way (it doesn't have worst case exponential behaviour).

so it depends on details. if you're confident that your regexps will be safe, or are willing to detect and abort slow matches, oniguruma makes sense. but personally i would be inclined to pay a little more (not much more) for the security of re2.

for more on this see http://swtch.com/~rsc/regexp/regexp1.html (by the re2 author).

[*] well, maybe never is too strong. for some regexps i think it has to fall back on a BT approach for certain cases (likely involving matching previous matches and lookahead). but it's still safer on most regexps.

like image 52
andrew cooke Avatar answered Nov 04 '22 17:11

andrew cooke


I've done a benchmark with the following librairies:

  • Boost
  • re2
  • Oniguruma

The benchmark consisted of executing a series of tests which made heavy use of regexps on very heterogeneous regexps (grouping, not grouping, long ones (484 characters), short ones, pipes, \?, *, ., etc.). Applied on texts that go from a few characters to around 8k characters.

Each time a regexp match was computed, I stored the regexp and incremented a milliseconds counter accumulating the time spent computing the regexp (called multiple times).

Here is the total time spent on all regexps for each libraries:

  • Boost: 98840 ms
  • re2: 51197 ms
  • Oniguruma: 16095 ms
  • re2 (NO CAPUTRE* see below)): 16162 ms

*We (almost) always want to capture the content of groups in regexp, and re2 performs horribly when it captures a group(see here). You don't see that much in the above result because when the group cannot be captured, it performs well. For example on this regexp (executed a lot of times):

^((?:https?://)?(?:[a-z0-9\-]{1,63}\.)+(?:[a-z0-9\-]{1,63}))(?:[^\?]*).*$

here are the results for each libs:

  • Boost: 140 ms
  • re2: 5663 ms
  • Oniguruma: 53 ms
  • re2 (NO CAPTURE): 37 ms.

See the drop for re2 from 5663 ms to 37 ms.

tl;dr

So my conclusion is that for our use, Oniguruma is clearly superior.

But if you don't need to capture groups, re2 is a better choice since I found that it's API is easier to use.

like image 40
pyrho Avatar answered Nov 04 '22 19:11

pyrho