Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why c++11 regex (libc++ implementation) is so slow?

Tags:

c++

c

regex

I compared with Linux C regex library,

#include <iostream>
#include <chrono>
#include <regex.h>

int main()
{
    const int count = 100000;

    regex_t exp;
    int rv = regcomp(&exp, R"_(([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?)_", REG_EXTENDED);
    if (rv != 0) {
            std::cout << "regcomp failed with " << rv << std::endl;
    }

    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < count; i++)
    {
            regmatch_t match;
            const char *sz = "http://www.abc.com";

            if (regexec(&exp, sz, 1, &match, 0) == 0) {
    //              std::cout << sz << " matches characters " << match.rm_so << " - " << match.rm_eo << std::endl;
            } else {
    //              std::cout << sz << " does not match" << std::endl;
            }
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

    std::cout << elapsed.count() << std::endl;

    return 0;
}

The result is roughly 60-70 milliseconds on my testing machine.

Then I used libc++'s library,

#include <iostream>
#include <chrono>
#include <regex>


int main()
{
        const int count = 100000;

        std::regex rgx(R"_(([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?)_", std::regex_constants::extended);
        auto start = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < count; i++)
        {
                std::cmatch match;
                const char sz[] = "http://www.abc.com";

                if (regex_search(sz, match, rgx)) {
                } else {
                }
        }
        auto end = std::chrono::high_resolution_clock::now();
        auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

        std::cout << "regex_search: " << elapsed.count() << std::endl;


        start = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < count; i++)
        {
                const char sz[] = "http://www.abc.com";

                if (regex_match(sz, rgx)) {
                } else {
                }
        }
        end = std::chrono::high_resolution_clock::now();
        elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

        std::cout << "regex_match: " << elapsed.count() << std::endl;

        return 0;
}

The result is roughly 2 seconds for both regex_search & regex_match. This is about 30 times slower than C's regex.h library.

Is there anything wrong with my comparison? Is C++'s regex library not suitable for high performance case?

I can understand it's slow because there's no optimization in c++'s regex library yet, but 30 times slower is just too much.

Thanks.


Hi all,

Thanks for answering.

Sorry for my mistake I was using [] for C too but later I changed, and forgot to change C++ code.

I made two changes,

  1. I moved const char sz[] out of the loop for both C & C++.
  2. I compiled it with -O2 (I wasn't using any optimization before), C library's implementation is still around 60 milliseconds, but libc++'s regex now gives a number says, 1 second for regex_search, and 150 milliseconds for regex_match.

This is still a bit slow, but not as much as the original comparison.

like image 367
user534498 Avatar asked Jan 06 '14 03:01

user534498


People also ask

Why is STD regex slow?

The current std::regex design and implementation are slow, mostly because the RE pattern is parsed and compiled at runtime. Users often don't need a runtime RE parser engine as the pattern is known during compilation in many common use cases.

What is boost regex?

Boost. Regex allows you to use regular expressions in C++. As the library is part of the standard library since C++11, you don't depend on Boost. Regex if your development environment supports C++11.


1 Answers

If you take a look at http://llvm.org/svn/llvm-project/libcxx/trunk/include/regex you'll see this implementation of regex_match is layered atop regex_search, and all overloads extract sub-expression match positions even if only into local temporaries that are thrown away. regex_search uses a vector of __state objects that have .resize() called on them so are presumably vectors too - all heap allocations and unnecessary when the subexpression matches aren't wanted, but would need to be tracked to support \1 etc in perl-style extensions to regular expressions: the old regcomp/regexec C functions didn't provide those extended features never have to do this extra work. Of course it would be nice if the clang implementation checked the regular expression's need for tracking matches during compilation and called leaner, faster functions to match when possible, but I guess they're just starting with support for the general case.

like image 190
Tony Delroy Avatar answered Oct 06 '22 10:10

Tony Delroy