Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex '|' operator vs separate runs for each sub-expression

I've got a fairly large string (~700k) against which I need to run 10 regexes and count all the matches of any of the regexes. My quick and dirty impl was to do something like re.search('(expr1)|(expr2)|...'), but I was wondering if we'd see any performance gains by matching in a loop instead:

In other words, I want to compare the performance of:

def CountMatchesInBigstring(bigstring, my_regexes):
  """Counts how many of the expressions in my_regexes match bigstring."""
  count = 0
  combined_expr = '|'.join(['(%s)' % r for r in my_regexes])
  matches = re.search(combined_expr, bigstring)
  if matches:
    count += NumMatches(matches)
  return count

vs

def CountMatchesInBigstring(bigstring, my_regexes):
  """Counts how many of the expressions in my_regexes match bigstring."""
  count = 0
  for reg in my_regexes:
    matches = re.search(reg, bigstring)
    if matches:
      count += NumMatches(matches)
  return count

I'll stop being lazy and run some tests tomorrow (and post the results), but I wondered whether the answer will jump out to someone who actually understands how regexes work :)

like image 206
Daniel H Avatar asked Feb 24 '09 08:02

Daniel H


People also ask

What is the difference between * and *??

*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1 , eventually matching 101 . All quantifiers have a non-greedy mode: . *? , .

What does ?= Mean in regular expression?

?= is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. Your example means the match needs to be followed by zero or more characters and then a digit (but again that part isn't captured).

How can you tell a regex that you want it to fit real parentheses and periods?

How would you specify that you want a regex to match actual parentheses and period characters? Periods and parentheses can be escaped with a backslash: \., \(, and \).

What is the point of re compile?

The re. compile(pattern, repl, string): We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it.


2 Answers

The two things will give slightly different results, unless it is guaranteed that a match will match one and only one regex. Otherwise if something matches 2 it will be counted twice.

In theory your solution ought to be quicker (if the expression are mutually exclusive) because the regex compiler ought to be able to make a more efficient search state machine, so only one pass is needed. I would expect the difference to be tiny though, unless the expressions are very similar.

Also, if it were a huge string (bigger than 700k) there might be gains from doing one pass, and so a factor of n fewer memory swaps would be needed (to disk or cpu cache).

My bet is in your tests it isn't really noticeable though. I'm interested in the actual result - please do post the results.

like image 134
Nick Fortescue Avatar answered Nov 03 '22 00:11

Nick Fortescue


To understand how re module works - compile _sre.c in debug mode (put #define VERBOSE at 103 line in _sre.c and recompile python). After this you ill see something like this:


>>> import re
>>> p = re.compile('(a)|(b)|(c)')
>>> p.search('a'); print '\n\n'; p.search('b')
|0xb7f9ab10|(nil)|SEARCH
prefix = (nil) 0 0
charset = (nil)
|0xb7f9ab1a|0xb7fb75f4|SEARCH
|0xb7f9ab1a|0xb7fb75f4|ENTER
allocating sre_match_context in 0 (32)
allocate/grow stack 1064
|0xb7f9ab1c|0xb7fb75f4|BRANCH
allocating sre_match_context in 32 (32)
|0xb7f9ab20|0xb7fb75f4|MARK 0
|0xb7f9ab24|0xb7fb75f4|LITERAL 97
|0xb7f9ab28|0xb7fb75f5|MARK 1
|0xb7f9ab2c|0xb7fb75f5|JUMP 20
|0xb7f9ab56|0xb7fb75f5|SUCCESS
discard data from 32 (32)
looking up sre_match_context at 0
|0xb7f9ab1c|0xb7fb75f4|JUMP_BRANCH
discard data from 0 (32)
|0xb7f9ab10|0xb7fb75f5|END




|0xb7f9ab10|(nil)|SEARCH
prefix = (nil) 0 0
charset = (nil)
|0xb7f9ab1a|0xb7fb7614|SEARCH
|0xb7f9ab1a|0xb7fb7614|ENTER
allocating sre_match_context in 0 (32)
allocate/grow stack 1064
|0xb7f9ab1c|0xb7fb7614|BRANCH
allocating sre_match_context in 32 (32)
|0xb7f9ab20|0xb7fb7614|MARK 0
|0xb7f9ab24|0xb7fb7614|LITERAL 97
discard data from 32 (32)
looking up sre_match_context at 0
|0xb7f9ab1c|0xb7fb7614|JUMP_BRANCH
allocating sre_match_context in 32 (32)
|0xb7f9ab32|0xb7fb7614|MARK 2
|0xb7f9ab36|0xb7fb7614|LITERAL 98
|0xb7f9ab3a|0xb7fb7615|MARK 3
|0xb7f9ab3e|0xb7fb7615|JUMP 11
|0xb7f9ab56|0xb7fb7615|SUCCESS
discard data from 32 (32)
looking up sre_match_context at 0
|0xb7f9ab2e|0xb7fb7614|JUMP_BRANCH
discard data from 0 (32)
|0xb7f9ab10|0xb7fb7615|END

>>>                                      

like image 29
Mykola Kharechko Avatar answered Nov 03 '22 00:11

Mykola Kharechko