I'm trying to figure out counter-intuitive performance differences between Python, Cython and pure C with regex matching.
There is a small sample program that takes a source text file (17 KB), a dictionary of 2000 words, creates a regex with those words (word1|word2|...), and finds all instances of said dictionary in the source file.
First, I've done a pure Python implementation which looks like this:
def scanFile(filename, patterns):
pattern_regex = re.compile('|'.join(patterns))
pageContent = open(filename).read()
matchingPatterns = set()
for matchObj in pattern_regex.finditer(pageContent):
matchingPatterns.add(matchObj.group(0))
return matchingPatterns
Then, I've tried optimising this by reimplementing the same with Cython, on top of regex.h
rather than Python's re
module.
cdef extern from "regex.h" nogil:
ctypedef struct regmatch_t:
int rm_so
int rm_eo
ctypedef struct regex_t:
pass
int REG_EXTENDED
int regcomp(regex_t* preg, const char* regex, int cflags)
int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags)
void regfree(regex_t* preg)
def matchPatterns(bytes pageContent, bytes regex):
cdef set matchingPatterns = set()
cdef regex_t regex_obj
cdef regmatch_t regmatch_obj[1]
cdef int regex_res = 0
cdef int current_str_pos = 0
regcomp(®ex_obj, regex, REG_EXTENDED)
regex_res = regexec(®ex_obj, pageContent[current_str_pos:], 1, regmatch_obj, 0)
while regex_res == 0:
matchingPatterns.add(pageContent[current_str_pos + regmatch_obj[0].rm_so: current_str_pos + regmatch_obj[0].rm_eo])
current_str_pos += regmatch_obj[0].rm_eo
regex_res = regexec(®ex_obj, pageContent[current_str_pos:], 1, regmatch_obj, 0)
regfree(®ex_obj)
return matchingPatterns
Performance, however, turned out to be precisely the other way around: Cython+regex.h takes about 2.34 sec and Python takes 0.92 sec.
After running a bit of profiling and custom commented out code, I confirmed the suspicion that it's down to regexec
which takes 10s of milliseconds on each invocation.
Just to make sure it's not Cython that's at fault, prepared a standalone C unit test which uses the same inputs and regex.h, and it also showed worse results than Python (about 1.60 sec, i.e. 60% slower than Python).
So, with all of that, I'd be thankful for any insight into why regexec
has such a poor performance.
I'm running this on Python 2.7.10, gcc 4.9.2, Cython 0.22, and the platform is Cygwin/Windows. I had a similar discrepancy when running the same on Ubuntu.
Based on what is in the question I can assume several problems: -You are using POSIX on Windows, and Cygwin - this is overhead, Windows is not POSIX system. -There is comparison between pcre (let me assume pcre2) and regex.h -Standalone compiled code differs from exported functions (compiler cannot assume anything) -Standalone C program has big footprint tells you that recompilation of pattern or some other stuff is happening under the hood. -compilation options, and potential aliasing are always hard to compare.
Apart from source for standalone program, always using translators/transpilers may give lags. Optimisation is nowadays task to give your compiler clean idea what you are doing and let it work.
And sorry for this part, not related to question per se, but it looks like you do not need RE but basic string matching algorithm or some neat structure like prefix tree and simple loop to do your task.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With