Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How much faster are regular expressions processed in C/Java than in Python? [closed]

I am looking for benchmarks that compare regular expression speeds between python and statically typed languages like C, Java or C++. I would also like to hear about Cython performance for regular expressions.

like image 925
Jeff Mandell Avatar asked Mar 17 '23 03:03

Jeff Mandell


1 Answers

This is likely to depend more on the individual implementation than the language.

Just for example, some patterns are O(N2) with some implementations, but ~O(N) with others. Specifically, most RE implementations are based on NFAs (Non-Deterministic Finite State Automata). To make a long story short, this means they can and will backtrack under some circumstances with some patterns. This gives roughly O(N2) complexity. A Deterministic Finite State Automata (DFA) matching the same pattern never backtracks--it always has linear complexity. At the same time, the compilation phase for a DFA is typically more complex than for an NFA (and DFAs don't have all the capabilities of NFAs).

Therefore, with many simple patterns that don't involve backtracking any way, an NFA-based RE engine may easily run faster than a DFA-based engine. But, when the NFA-based RE engine is trying to match a pattern than involves backtracking, it can (and will) slow down drastically. In the latter case, the DFA-based engine may easily be many times faster.

Most RE libraries basically start from a regular expression represented as a string. When you do an RE based search/match, most compile that into a data structure for their NFA/DFA. That compilation step takes some time (not a huge amount, but can become significant, especially if you're working with a lot of different REs). A few RE engines (e.g., Boost XPressive) can compile regular expressions statically--that is, the RE is compiled at the same time as the program's source code. This can eliminate the time to compile the RE from the program's execution time, so if your code spends a significant amount of its time on compiling REs, it could gain a substantial improvement from that (but that's independent of just static typing--at least to my knowledge, you can't get the same in Java or C, or example). A few other languages (e.g., D) provide enough capabilities that you could almost certainly do the same with them, but I'm not aware of an actual implementation for them that you can plan on using right now.

like image 64
Jerry Coffin Avatar answered Apr 26 '23 12:04

Jerry Coffin