Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex.h performance

I'm trying to figure out counter-intuitive performance differences between Python, Cython and pure C with regex matching.

There is a small sample program that takes a source text file (17 KB), a dictionary of 2000 words, creates a regex with those words (word1|word2|...), and finds all instances of said dictionary in the source file.

First, I've done a pure Python implementation which looks like this:

def scanFile(filename, patterns):
   pattern_regex = re.compile('|'.join(patterns))
   pageContent = open(filename).read()
   matchingPatterns = set()
   for matchObj in pattern_regex.finditer(pageContent):
      matchingPatterns.add(matchObj.group(0))

   return matchingPatterns

Then, I've tried optimising this by reimplementing the same with Cython, on top of regex.h rather than Python's re module.

cdef extern from "regex.h" nogil:
   ctypedef struct regmatch_t:
      int rm_so
      int rm_eo
   ctypedef struct regex_t:
      pass
   int REG_EXTENDED
   int regcomp(regex_t* preg, const char* regex, int cflags)
   int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags)
   void regfree(regex_t* preg) 

def matchPatterns(bytes pageContent, bytes regex):
   cdef set matchingPatterns = set()
   cdef regex_t regex_obj
   cdef regmatch_t regmatch_obj[1]
   cdef int regex_res = 0
   cdef int current_str_pos = 0

   regcomp(&regex_obj, regex, REG_EXTENDED)
   regex_res = regexec(&regex_obj, pageContent[current_str_pos:], 1, regmatch_obj, 0)
   while regex_res == 0:
      matchingPatterns.add(pageContent[current_str_pos + regmatch_obj[0].rm_so: current_str_pos + regmatch_obj[0].rm_eo])
      current_str_pos += regmatch_obj[0].rm_eo
      regex_res = regexec(&regex_obj, pageContent[current_str_pos:], 1, regmatch_obj, 0)

   regfree(&regex_obj)
   return matchingPatterns

Performance, however, turned out to be precisely the other way around: Cython+regex.h takes about 2.34 sec and Python takes 0.92 sec.

After running a bit of profiling and custom commented out code, I confirmed the suspicion that it's down to regexec which takes 10s of milliseconds on each invocation.

Just to make sure it's not Cython that's at fault, prepared a standalone C unit test which uses the same inputs and regex.h, and it also showed worse results than Python (about 1.60 sec, i.e. 60% slower than Python).

So, with all of that, I'd be thankful for any insight into why regexec has such a poor performance.

I'm running this on Python 2.7.10, gcc 4.9.2, Cython 0.22, and the platform is Cygwin/Windows. I had a similar discrepancy when running the same on Ubuntu.

like image 617
RomanK Avatar asked Nov 09 '22 10:11

RomanK


1 Answers

Based on what is in the question I can assume several problems: -You are using POSIX on Windows, and Cygwin - this is overhead, Windows is not POSIX system. -There is comparison between pcre (let me assume pcre2) and regex.h -Standalone compiled code differs from exported functions (compiler cannot assume anything) -Standalone C program has big footprint tells you that recompilation of pattern or some other stuff is happening under the hood. -compilation options, and potential aliasing are always hard to compare.

Apart from source for standalone program, always using translators/transpilers may give lags. Optimisation is nowadays task to give your compiler clean idea what you are doing and let it work.

And sorry for this part, not related to question per se, but it looks like you do not need RE but basic string matching algorithm or some neat structure like prefix tree and simple loop to do your task.

like image 58
Evil Avatar answered Nov 14 '22 22:11

Evil