How good is Oniguruma compared to other cross-platform regexp libraries?

Question

We are trying to get rid of boost::regex and it's awful performance. According to this benchmark, Oniguruma is the best overall.

We have multiple regexps (and always changing) which we apply on strings ranging from medium (100 chars) to huge (1k chars)... so it's a very heterogenous environment.

Have any of you used it with success ? Do you recommend going for the more "standard" ones like PCRE or RE2 ?

Thanks !

andrew cooke · Accepted Answer

the two kinds of implementation (FSA and BT) have quite different behaviours, which you can see in the right-hand column (email) there.

oniguruma is generally fast, but has the possibility of running slowly if you're "unlucky" with a particular regexp. that's because it's a backtracking algorithm.

in contrast, while re2 is generally a little slower, it doesn't have the same risk - its time will never[*] explode in the same way (it doesn't have worst case exponential behaviour).

so it depends on details. if you're confident that your regexps will be safe, or are willing to detect and abort slow matches, oniguruma makes sense. but personally i would be inclined to pay a little more (not much more) for the security of re2.

for more on this see http://swtch.com/~rsc/regexp/regexp1.html (by the re2 author).

[*] well, maybe never is too strong. for some regexps i think it has to fall back on a BT approach for certain cases (likely involving matching previous matches and lookahead). but it's still safer on most regexps.

pyrho · Answer

I've done a benchmark with the following librairies:

Boost
re2
Oniguruma

The benchmark consisted of executing a series of tests which made heavy use of regexps on very heterogeneous regexps (grouping, not grouping, long ones (484 characters), short ones, pipes, \?, *, ., etc.). Applied on texts that go from a few characters to around 8k characters.

Each time a regexp match was computed, I stored the regexp and incremented a milliseconds counter accumulating the time spent computing the regexp (called multiple times).

Here is the total time spent on all regexps for each libraries:

Boost: 98840 ms
re2: 51197 ms
Oniguruma: 16095 ms
re2 (NO CAPUTRE* see below)): 16162 ms

*We (almost) always want to capture the content of groups in regexp, and re2 performs horribly when it captures a group(see here). You don't see that much in the above result because when the group cannot be captured, it performs well. For example on this regexp (executed a lot of times):

^((?:https?://)?(?:[a-z0-9\-]{1,63}\.)+(?:[a-z0-9\-]{1,63}))(?:[^\?]*).*$

here are the results for each libs:

Boost: 140 ms
re2: 5663 ms
Oniguruma: 53 ms
re2 (NO CAPTURE): 37 ms.

See the drop for re2 from 5663 ms to 37 ms.

tl;dr

So my conclusion is that for our use, Oniguruma is clearly superior.

But if you don't need to capture groups, re2 is a better choice since I found that it's API is easier to use.

How good is Oniguruma compared to other cross-platform regexp libraries?

Tags:

c++

performance

c

regex

cross-platform

pyrho

2 Answers

andrew cooke

tl;dr

pyrho

Recent Activity

Donate For Us

How good is Oniguruma compared to other cross-platform regexp libraries?

Tags:

c++

performance

c

regex

cross-platform

pyrho

2 Answers

andrew cooke

tl;dr

pyrho

Related questions

Recent Activity

Donate For Us