Could it be that python's C regex implementation is 6 times faster or am I missing something ?
Python version:
import re
r=re.compile(r'(HELLO).+?(\d+)', re.I)
s=r"prefixdfadfadf adf adf adf adf he asdf dHello Regex 123"
%timeit r.search(s)
1000000 loops, best of 3: 1.3 µs per loop (769,000 per sec)
C++11 version:
#include<regex>
int main(int argc, char * argv[])
{
std::string s = "prefixdfadfadf adf adf adf adf he asdf dHello Regex 123";
std::regex my(R"((HELLO).+?(\d+))", regex_constants::icase);
bench_utils::run(std::chrono::seconds(10),
[&]{
std::smatch match;
bool found = std::regex_search(s, match, my);
});
return 0;
}
Results in about ~125,000 searches/second
Edit: Here is the code for bench_utils:
namespace bench_utils
{
template<typename T>
inline std::string formatNum(const T& value)
{
static std::locale loc("");
std::stringstream ss;
ss.imbue(loc);
ss << value;
return ss.str();
}
inline void run(const std::chrono::milliseconds &duration,
const std::function<void() >& fn)
{
using namespace std::chrono;
typedef steady_clock the_clock;
size_t counter = 0;
seconds printInterval(1);
auto startTime = the_clock::now();
auto lastPrintTime = startTime;
while (true)
{
fn();
counter++;
auto now = the_clock::now();
if (now - startTime >= duration)
break;
auto p = now - lastPrintTime;
if (now - lastPrintTime >= printInterval)
{
std::cout << formatNum<size_t>(counter) << " ops per second" << std::endl;
counter = 0;
lastPrintTime = the_clock::now();
}
}
}
}
Conclusion. grep is so much faster than the regex engine of Python that even reading the whole file several times does not matter.
I would say C++11 regex is much slower than perl and possibly than python. To measure properly the performance it is better to do the tests using some not trivial expression or else you are measuring everything but the regex itself. in real scenarios I get ratios about 100 to 200 times slower.
Two-regex is now 15.9 times slower than string replacement, and regex/lambda 38.8 times slower.
Regex is faster for large string than an if (perhaps in a for loops) to check if anything matches your requirement.
The first thing to note is that in Python, regex (whether using the re
, or regex
module) occurs 'at the speed of c', that is the actual heavy lifting code is cold hard c, and thus at least for longer strings the performance is going to depend on the c regexp implementation.
Sometimes python is pretty clever, python has no trouble performing in the vicinity of tens of millions of operations per second and it can create millions of objects per second - this is a thousand times slower than c, but if we're talking something that takes microseconds to begin with, the python overhead may not really matter, it will only add 0.1 microseconds to each function call.
So in this case the relative slowness of Python doesn't matter. It's fast enough in absolute terms that what matters is how fast the regular expression functions do their thing.
I rewrote the c++ case to be not subject to any criticisms (I hope, feel free to point out any), in fact it doesn't even need to create a match object as search simply returns a bool (true/false):
#include <regex>
#include <iostream>
int main(int argc, char * argv[])
{
std::string s = "prefixdfadfadf adf adf adf adf he asdf dHello Regex 123";
std::regex my(R"((HELLO).+?(\d+))", std::regex_constants::icase);
int matches = 0;
for (int i = 0; i < 1000000; ++i)
matches += std::regex_search(s, my);
std::cout << matches << std::endl;
return 0;
}
I wrote a comparable python program (although python did create and return a match object) and my results were exactly the same as yours
c++ : 6.661s python: 1.039s
I think the basic conclusion here is that Python's regex implementation simply thrashes the c++ standard library one.
A while back just for fun I compared Python's regex performance with Go's regex performance. And python was at least twice as fast.
The conclusion is that python's regexp implementation is very good and you should certainly not look outside Python to get improved regexp performance. The work regular expression do is fundamentally time consuming enough that Python's overhead doesn't really matter in the slightest and Python's got a great implementation (and the new regex
module is often even faster than re
).
Using timeit to do benchmarks is wrong since it gives you best of 3 and not a statistical difference test.
It's your code, not the language.
std::function
will make the C++ code slower;std::smatch
match; in each iteration;I also wonder what optimization you are running with.
The run()
function is doing too much. Fix that. :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With