Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Would compiling a regex into native assembly be faster than PCRE or other Regex engines? [closed]

I was thinking about an improvement. I'm currently doing lots of text processing of log files.

I don't mean to say PCRE is slow/fast or any other implementation for that matter.

The language I'm writing in is primarily Perl. I know it has a powerful regex engine and I know it's more expressive than PCRE.

I have this idea about making a small regex engine in C++ that would compile a regex to raw nasm.

I'm aware PCRE is quite complex and my assumption is that I could skip a lot of things done by PCRE in terms of non-necessary processing. And I could certainly make this faster than Perl since it operates with vm-like opcodes and all sorts of things that could be considered as being overhead.

I already started an implementation some time ago. I'm not going to post it here since I don't have any problems with it, I could carry it out to the end and obtain a regex engine capable of doing captures, capable of interpreting + * ^ $ , character classes (although I haven't done the part where I'll convert the regex to assembly language)

Would this be a good idea or a bad idea? What could go wrong in terms of reaching a good performance with this?

tl;dr => could a C++ mini-regex engine that would produce native assembly be faster than established regex implementations?

like image 257
wsdookadr Avatar asked Mar 25 '13 04:03

wsdookadr


4 Answers

I think the answer to this one is pretty obvious. Ideally, yes, it could be. But even if you're very clever at this kind of thing it would take a very large development effort to get to a point where, for the most part, you're only slightly better than available libraries — and even longer to work all of the bugs out. So there's not much point in it.

like image 191
hobbs Avatar answered Nov 13 '22 11:11

hobbs


Perl's regexp is fast but not blazingly fast. See Russ Cox's analysis of Perl vs. Thompson-style regexp engines for an eye popping demonstration on the difference in performance, and theory on how to do it right.

If you don't want to implement Thompson/Cox's scheme, you should consider Ragel, a very fast regular expression processor. Ragel does the hard work of building efficient automatons, and then generates code for target compiler languages. It lets the compiler do the hard work of converting "to machine code".

It is likely to be the engine you are contemplating building.

If these aren't good enough, you should track down recent work by IBM in extremely fast stream matchers. I think papers by Davide Pasetto are probably relevant.

like image 31
Ira Baxter Avatar answered Nov 13 '22 09:11

Ira Baxter


Just do a benchmark. Take a regexp which is quite simple, yet something you would commonly use. Then make a hard-coded version of that regexp. Compare using realistic amount of data, where operation takes a long enough for speedup to matter.

I think any good regexp engine's inner loop is already pretty optimized, using both state-of-the-art text matching algorithms, and are optimized to do them fast. So good luck getting faster with your own code, when expression is not something trivial (char classes for example).

Obviously there is overhead, you could hard-code stuff doing it your way and achieve a nice linear speedup, all else being equal. But that is secondary to having good algorithms, and if you are not familiar with concept of algorithmic complexity, read up on that first.


But there is one critical thing, which is almost a showstopper for real use. Regexp engine must be correct. It must find what it is supposed,, with zero false positives. Bad things can happen if you get false match or miss a match, such as data corruption. How do you achieve high enough confidence on your engine to dare to use it?

Create one for fun, sure. For real production use... ummm...


If you are potentially serious, you need automated test from the start. For something like regexp, you also want code coverage analysis for your tests. As you will have deterministic input and output, no user input or anything, ultimately you should aim for 100% coverage of relevant parts of the code, both for engine and for generated code of your test regexps. Oh, and don't forget automatic benchmarks either, when goal is speed! Of course you want to get coding and not care about this stuff at first, and that's fine, but set up the infrastructure from start, and when you test your code, write it as automated test, resist urge to do ad-hoc testing manually (except as a step of writing an automated test case). Your project has much higher chance of succeeding, if you do things right.

For generated code, LLVM is probably what you want to use, instead of assembler of some particular CPU.

like image 37
hyde Avatar answered Nov 13 '22 10:11

hyde


You could probably squeeze out significantly more performance from a regex compiled to machine instructions. Whether it makes sense to do so depends completely on how you amortize the cost.

If you are trying to reach the specific goal of parsing log files and have a deadline, then it might not make sense to even consider this until you've proven that existing libraries (including Google's RE2) are too slow for you. Worrying about your RE performance before you've determined it's a bottleneck is a classic case of premature optimization.

If, however, you want to do this as a challenge and learning exercise, and there's no deadline looming, then by all means give it a shot. Be prepared for a lot of work, and don't forget to write a lot of test cases because there will be bugs. Your regex engine will be the critical foundation of the overall work product, and you cannot afford for it to misbehave in production.

Good luck!

like image 2
Jim Garrison Avatar answered Nov 13 '22 10:11

Jim Garrison