Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recommendations for a C implementation of a regex parser

I'm thinking about implementing a regular expression parser in a C library I'm developing. Now, the question is: is there any open source code that I could use verbatim or with as few changes as possible? My expectations regarding the code are:

  • it needs to be written in C (not C++)
  • it needs to compile under gcc, mingw, M$VC
  • it mustn't depend on any third party or OS-specific headers/libraries (ie, everything needed to compile it must be readily available with a base installation of gcc, mingw, M$VC
  • it would be nice if it used Perl-compatible regex syntax (like PCRE in PHP).
  • ideally, the code should be as compact as possible

Are there any ready-made solutions that you could recommend? I was looking at PCRE for C and it looks like it has everything that's available in PHP (which rules), but the size (1.4MB DL) is a bit intimidating. Do you think it's a solid bet? Or are there other options worth considering?

[EDIT]

The library I'm developing is open source, BSD licence.

like image 706
mingos Avatar asked Dec 10 '10 14:12

mingos


People also ask

Is regex CPU intensive?

regex is expensive – regex is often the most CPU-intensive part of a program. And a non-matching regex can be even more expensive to check than a matching one.

Can you do regex in C?

A regular expression is a sequence of characters used to match a pattern to a string. The expression can be used for searching text and validating input. Remember, a regular expression is not the property of a particular language. POSIX is a well-known library used for regular expressions in C.

Does regex affect performance?

Being more specific with your regular expressions, even if they become much longer, can make a world of difference in performance. The fewer characters you scan to determine the match, the faster your regexes will be.


2 Answers

PCRE is so big because regular expressions are hard. And most of it is documentation and support code anyways; it's much smaller when compiled into object code.

like image 92
Ignacio Vazquez-Abrams Avatar answered Sep 17 '22 10:09

Ignacio Vazquez-Abrams


RE2, the Google regexp implementation does a match in linear time (O(n) if n is the length of the string), PCRE and most other regexp engines run in exponential time at worst case. Another noteworthy O(n) regexp matcher is flex, but it needs all possible regexps at compile time. If you are looking for something smaller than PCRE, look at the regexp matcher in busybox, or the pattern matcher in lua.

like image 45
pts Avatar answered Sep 19 '22 10:09

pts