Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why pcre regex is much faster than c++11 regex

Some sample code. This is the c++11 part using cregex_iterator:

std::chrono::steady_clock::time_point begin0 = std::chrono::steady_clock::now();
regex re("<option[\\s]value[\\s]*=[\\s]*\"([^\">]*)\"[\\s]*[^>]*>", regex::icase);
int found = 0;
for (std::cregex_iterator i = std::cregex_iterator(input, input + input_length, re);
i != std::cregex_iterator();
    ++i)
{
    found++;
    if (found < 10000) continue;
    break;
}
std::chrono::steady_clock::time_point end0 = std::chrono::steady_clock::now();

This is the pcre part. The regexp is all the same.

std::chrono::steady_clock::time_point begin4 = std::chrono::steady_clock::now();
const char *pError = NULL;
int errOffset;
int options = PCRE_MULTILINE | PCRE_CASELESS;
const char* regexp = "<option[\\s]value[\\s]*=[\\s]*\"([^\">]*)\"[\\s]*[^>]*>";
pcre* pPcre = pcre_compile(regexp, options, &pError, &errOffset, 0);                
int offset = 0;
int matches = -1;
int pMatches[6];
while (offset < input_length)
{
    matches = pcre_exec(pPcre,NULL, input, input_length, offset,0, pMatches,6); 
    if (matches >= 1)
    {
        found++;
        offset = pMatches[1];
        if (found < 10000) continue;
        break;  // find match
    }
    else
        offset = input_length;
}

std::chrono::steady_clock::time_point end4 = std::chrono::steady_clock::now();

The result shows pcre is 100 times faster than c++11. I found some vector copy and resize in c++11 implementation. Are there some other reasons?

like image 263
Levi Avatar asked Jan 05 '17 09:01

Levi


1 Answers

PCRE benefits from some optimizations known as start-up optimizations which are configured to be enabled by default. These optimizations include:

  1. A subject pre-scan for unanchored patterns (if a starting point is not found engine doesn't even bother to go through matching process.)
  2. Studying pattern to ensure that minimum length of subject is not shorter than pattern itself
  3. Auto-possessification
  4. Fast failure (if a specific point is not found engine doesn't even bother to go through matching process.)

Superficial pattern analyzing:

<option             # Subject pre-scan applied (unachored pattern)
    [\\s]
    value
    [\\s]*          # Auto-possessification applied (translates to \s*+)
    =
    [\\s]*          # //
    \"([^\">]*)\"   
    [\\s]*          # //
    [^>]*
>                   # Min length (17 chars) check of subject string applied

Furthermore, if input string doesn't have a special character like >, a fast failure is supposed to be thrown. You should know that performance can depend on input string heavily as well.

Run below pattern:

(*NO_AUTO_POSSESS)(*NO_START_OPT)<option[\s]value[\s]*=[\s]*\"([^\">]*)\"[\s]*[^>]*>

over this input string (watch that period):

<option value                                                                 .

and compare result (Live demo).

like image 145
revo Avatar answered Nov 17 '22 03:11

revo