Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimization techniques used by std::regex_constants::optimize

I am working with std::regex, and whilst reading about the various constants defined in std::regex_constants, I came across std::optimize, reading about it, it sounds like it is useful in my application (I only need one instance of the regex, initialized at the beginning, but it is used multiple times throughout the loading process).

According to the working paper n3126 (pg. 1077), std::regex_constants::optimize:

Specifies that the regular expression engine should pay more attention to the speed with which regular expressions are matched, and less to the speed with which regular expression objects are constructed. Otherwise it has no detectable effect on the program output.

I was curious as to what type of optimization would be performed, but there doesn't seem to be much literature about it (indeed, it seems to be undefined), and one of the only things I found was at cppreference.com, which stated that std::regex_constants::optimize:

Instructs the regular expression engine to make matching faster, with the potential cost of making construction slower. For example, this might mean converting a non-deterministic FSA to a deterministic FSA.

However, I have no formal background in computer science, and whilst I'm aware of the basics of what an FSA is, and understand the basic difference between a deterministic FSA (each state only has one possible next state), and a non-deterministic FSA (with multiple potential next states); I do not understand how this improves matching time. Also, I would be interested to know if there are any other optimizations in various C++ Standard Library implementations.

like image 237
Thomas Russell Avatar asked Jul 21 '12 13:07

Thomas Russell


People also ask

Why is C++ regex so slow?

The Regular Expression is obviously all about string, it needs to play with string everywhere. std::regex plays a lot with std::string , that's why the performance of std::regex is not good. std::regex can't do much thing at compile time because it uses std::string , this could cause the performance issue too.

How is regex so fast?

Good regular expressions are often longer than bad regular expressions because they make use of specific characters/character classes and have more structure. This causes good regular expressions to run faster as they predict their input more accurately.


2 Answers

There's some useful information on the topic of regex engines and performance trade offs (far more than can fit in a stackoverflow answer) in Mastering Regular Expressions by Jeffrey Friedl.

It's worth noting that Boost.Regex, which was the source for N3126, documents optimize as "This currently has no effect for Boost.Regex."

P.S.

indeed, it seems to be implementation-defined

No, it's unspecified. Implementation-defined means an implementation is required to define the choice of behaviour. Implementations are not required to document how their regex engines are implemented or what (if any) difference the optimize flag makes.

P.S. 2

in various STL implementations

std::regex is not part of the STL, the C++ Standard Library is not the same thing as the STL.

like image 184
Jonathan Wakely Avatar answered Oct 19 '22 05:10

Jonathan Wakely


See http://swtch.com/~rsc/regexp/regexp1.html for a nice explanation on how NFA based regex implementations can avoid the exponential backtracking that occurs in DFA matchers in certain circumstances.

like image 24
JohannesD Avatar answered Oct 19 '22 05:10

JohannesD