Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex speed: Python x6 times faster than C++11 under VS2013?

Could it be that python's C regex implementation is 6 times faster or am I missing something ?

Python version:

import re
r=re.compile(r'(HELLO).+?(\d+)', re.I)
s=r"prefixdfadfadf adf adf adf adf he asdf dHello Regex 123"

%timeit r.search(s)

1000000 loops, best of 3: 1.3 µs per loop (769,000 per sec)

C++11 version:

#include<regex>
int main(int argc, char * argv[])
{
    std::string s = "prefixdfadfadf adf adf adf adf he asdf dHello Regex 123";
    std::regex my(R"((HELLO).+?(\d+))", regex_constants::icase);

    bench_utils::run(std::chrono::seconds(10),
        [&]{
        std::smatch match;
        bool found = std::regex_search(s, match, my);
    });       
    return 0;
}

Results in about ~125,000 searches/second

Edit: Here is the code for bench_utils:

namespace bench_utils
{
    template<typename T>    
    inline std::string formatNum(const T& value)
    {
            static std::locale loc("");
            std::stringstream ss;
            ss.imbue(loc);
            ss << value;
            return ss.str();
        }

    inline void run(const std::chrono::milliseconds &duration,
        const std::function<void() >& fn)
    {
        using namespace std::chrono;
        typedef steady_clock the_clock;
        size_t counter = 0;
        seconds printInterval(1);
        auto startTime = the_clock::now();
        auto lastPrintTime = startTime;
        while (true)
        {
            fn();
            counter++;
            auto now = the_clock::now();
            if (now - startTime >= duration)
                break;
            auto p = now - lastPrintTime;
            if (now - lastPrintTime >= printInterval)
            {
                std::cout << formatNum<size_t>(counter) << " ops per second" << std::endl;
                counter = 0;
                lastPrintTime = the_clock::now();
            }
        }
    }

}
like image 679
GabiMe Avatar asked Feb 26 '14 07:02

GabiMe


People also ask

Is regex faster Python?

Conclusion. grep is so much faster than the regex engine of Python that even reading the whole file several times does not matter.

Is regex slow C++?

I would say C++11 regex is much slower than perl and possibly than python. To measure properly the performance it is better to do the tests using some not trivial expression or else you are measuring everything but the regex itself. in real scenarios I get ratios about 100 to 200 times slower.

Is regex faster than string replace Python?

Two-regex is now 15.9 times slower than string replacement, and regex/lambda 38.8 times slower.

Is regular expression faster than loop?

Regex is faster for large string than an if (perhaps in a for loops) to check if anything matches your requirement.


2 Answers

The first thing to note is that in Python, regex (whether using the re, or regex module) occurs 'at the speed of c', that is the actual heavy lifting code is cold hard c, and thus at least for longer strings the performance is going to depend on the c regexp implementation.

Sometimes python is pretty clever, python has no trouble performing in the vicinity of tens of millions of operations per second and it can create millions of objects per second - this is a thousand times slower than c, but if we're talking something that takes microseconds to begin with, the python overhead may not really matter, it will only add 0.1 microseconds to each function call.

So in this case the relative slowness of Python doesn't matter. It's fast enough in absolute terms that what matters is how fast the regular expression functions do their thing.

I rewrote the c++ case to be not subject to any criticisms (I hope, feel free to point out any), in fact it doesn't even need to create a match object as search simply returns a bool (true/false):

#include <regex>
#include <iostream>

int main(int argc, char * argv[])
{
    std::string s = "prefixdfadfadf adf adf adf adf he asdf dHello Regex 123";
    std::regex my(R"((HELLO).+?(\d+))", std::regex_constants::icase);

    int matches = 0;
    for (int i = 0; i < 1000000; ++i)
        matches += std::regex_search(s, my);


    std::cout << matches  << std::endl;
    return 0;
}

I wrote a comparable python program (although python did create and return a match object) and my results were exactly the same as yours

c++   : 6.661s
python: 1.039s

I think the basic conclusion here is that Python's regex implementation simply thrashes the c++ standard library one.

It thrashes Go too

A while back just for fun I compared Python's regex performance with Go's regex performance. And python was at least twice as fast.

The conclusion is that python's regexp implementation is very good and you should certainly not look outside Python to get improved regexp performance. The work regular expression do is fundamentally time consuming enough that Python's overhead doesn't really matter in the slightest and Python's got a great implementation (and the new regex module is often even faster than re).

like image 184
Blake Walsh Avatar answered Oct 21 '22 15:10

Blake Walsh


Using timeit to do benchmarks is wrong since it gives you best of 3 and not a statistical difference test.

It's your code, not the language.

  1. Passing the function as a std::function will make the C++ code slower;
  2. Calling clock functions in every iterations;
  3. Creating new objects, such as the std::smatch match; in each iteration;
  4. The run function;
  5. Not precompiling the regex.

I also wonder what optimization you are running with.

The run() function is doing too much. Fix that. :)

like image 33
Good Person Avatar answered Oct 21 '22 15:10

Good Person