regex '|' operator vs separate runs for each sub-expression

Tags:

I've got a fairly large string (~700k) against which I need to run 10 regexes and count all the matches of any of the regexes. My quick and dirty impl was to do something like re.search('(expr1)|(expr2)|...'), but I was wondering if we'd see any performance gains by matching in a loop instead:

In other words, I want to compare the performance of:

def CountMatchesInBigstring(bigstring, my_regexes):
  """Counts how many of the expressions in my_regexes match bigstring."""
  count = 0
  combined_expr = '|'.join(['(%s)' % r for r in my_regexes])
  matches = re.search(combined_expr, bigstring)
  if matches:
    count += NumMatches(matches)
  return count

def CountMatchesInBigstring(bigstring, my_regexes):
  """Counts how many of the expressions in my_regexes match bigstring."""
  count = 0
  for reg in my_regexes:
    matches = re.search(reg, bigstring)
    if matches:
      count += NumMatches(matches)
  return count

I'll stop being lazy and run some tests tomorrow (and post the results), but I wondered whether the answer will jump out to someone who actually understands how regexes work :)

206

asked Feb 24 '09 08:02

Daniel H

2 Answers

The two things will give slightly different results, unless it is guaranteed that a match will match one and only one regex. Otherwise if something matches 2 it will be counted twice.

In theory your solution ought to be quicker (if the expression are mutually exclusive) because the regex compiler ought to be able to make a more efficient search state machine, so only one pass is needed. I would expect the difference to be tiny though, unless the expressions are very similar.

Also, if it were a huge string (bigger than 700k) there might be gains from doing one pass, and so a factor of n fewer memory swaps would be needed (to disk or cpu cache).

My bet is in your tests it isn't really noticeable though. I'm interested in the actual result - please do post the results.

134

answered Nov 03 '22 00:11

Nick Fortescue

To understand how re module works - compile _sre.c in debug mode (put #define VERBOSE at 103 line in _sre.c and recompile python). After this you ill see something like this:


>>> import re
>>> p = re.compile('(a)|(b)|(c)')
>>> p.search('a'); print '\n\n'; p.search('b')
|0xb7f9ab10|(nil)|SEARCH
prefix = (nil) 0 0
charset = (nil)
|0xb7f9ab1a|0xb7fb75f4|SEARCH
|0xb7f9ab1a|0xb7fb75f4|ENTER
allocating sre_match_context in 0 (32)
allocate/grow stack 1064
|0xb7f9ab1c|0xb7fb75f4|BRANCH
allocating sre_match_context in 32 (32)
|0xb7f9ab20|0xb7fb75f4|MARK 0
|0xb7f9ab24|0xb7fb75f4|LITERAL 97
|0xb7f9ab28|0xb7fb75f5|MARK 1
|0xb7f9ab2c|0xb7fb75f5|JUMP 20
|0xb7f9ab56|0xb7fb75f5|SUCCESS
discard data from 32 (32)
looking up sre_match_context at 0
|0xb7f9ab1c|0xb7fb75f4|JUMP_BRANCH
discard data from 0 (32)
|0xb7f9ab10|0xb7fb75f5|END




|0xb7f9ab10|(nil)|SEARCH
prefix = (nil) 0 0
charset = (nil)
|0xb7f9ab1a|0xb7fb7614|SEARCH
|0xb7f9ab1a|0xb7fb7614|ENTER
allocating sre_match_context in 0 (32)
allocate/grow stack 1064
|0xb7f9ab1c|0xb7fb7614|BRANCH
allocating sre_match_context in 32 (32)
|0xb7f9ab20|0xb7fb7614|MARK 0
|0xb7f9ab24|0xb7fb7614|LITERAL 97
discard data from 32 (32)
looking up sre_match_context at 0
|0xb7f9ab1c|0xb7fb7614|JUMP_BRANCH
allocating sre_match_context in 32 (32)
|0xb7f9ab32|0xb7fb7614|MARK 2
|0xb7f9ab36|0xb7fb7614|LITERAL 98
|0xb7f9ab3a|0xb7fb7615|MARK 3
|0xb7f9ab3e|0xb7fb7615|JUMP 11
|0xb7f9ab56|0xb7fb7615|SUCCESS
discard data from 32 (32)
looking up sre_match_context at 0
|0xb7f9ab2e|0xb7fb7614|JUMP_BRANCH
discard data from 0 (32)
|0xb7f9ab10|0xb7fb7615|END

>>>

answered Nov 03 '22 00:11

Mykola Kharechko

Related questions
                            
                                opencv python reading image as RGB
                            
                                Change the text color of cells in Plotly table based on value (string)
                            
                                Align Button To The center Of the window using pysimplegui
                            
                                Running poetry fails with /usr/bin/env: ‘python’: No such file or directory
                            
                                Custom authentication for FastAPI
                            
                                Pandas: Summing all elements in a dataframe? [duplicate]
                            
                                Easily check if table exists with python, sqlalchemy on an sql database
                            
                                Python Logging - AttributeError: module 'logging' has no attribute 'handlers'
                            
                                How to dump confusion matrix using TensorBoard logger in pytorch-lightning?
                            
                                Extract distinct values from Dataframe and insert them into new Dataframe with same column Name
                            
                                nbconvert use '--allow-chromium-download'
                            
                                How to pass each row of a dataFrame to an array
                            
                                Why am I getting this Index out of range error?
                            
                                how to get a subgroup start finish indexes of dataframe
                            
                                ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd
                            
                                How do you create a random list of integers, but exclude a certain number?
                            
                                How to update dictionaries inside two separate lists?
                            
                                Open-source fractal maps [closed]
                            
                                Make python enter password when running a csh script
                            
                                Where should I post my python code?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

regex '|' operator vs separate runs for each sub-expression

Tags:

performance

python

regex

Daniel H

People also ask

2 Answers

Nick Fortescue

Mykola Kharechko

Recent Activity

Donate For Us