Combining multiple regex substitutions

Question

I'm trying to delete some things from a block of text using regex. I have all of my patterns ready, but I can't seem to be able to remove two (or more) that overlap.

For example:

import re

r1 = r'I am'
r2 = r'am foo'

text = 'I am foo'

re.sub(r1, '', text)   # Returns ' foo'
re.sub(r2, '', text)   # Returns 'I '

How do I replace both of the occurrences simultaneously and end up with an empty string?

I ended up using a slightly modified version of Ned Batchelder's answer:

def clean(self, text):
  mask = bytearray(len(text))

  for pattern in patterns:
    for match in re.finditer(pattern, text):
      r = range(match.start(), match.end())

      mask[r] = 'x' * len(r)

  return ''.join(character for character, bit in zip(text, mask) if not bit)

Ned Batchelder · Accepted Answer

You can't do it with consecutive re.sub calls as you have shown. You can use re.finditer to find them all. Each match will provide you with a match object, which has .start and .end attributes indicating their positions. You can collect all those together, and then remove characters at the end.

Here I use a bytearray as a mutable string, used as a mask. It's initialized to zero bytes, and I mark with an 'x' all the bytes that match any regex. Then I use the bit mask to select the characters to keep in the original string, and build a new string with only the unmatched characters:

bits = bytearray(len(text))
for pat in patterns:
    for m in re.finditer(pat, text):
        bits[m.start():m.end()] = 'x' * (m.end()-m.start())
new_string = ''.join(c for c,bit in zip(text, bits) if not bit)

Carl Walsh · Answer

Not to be a downer, but the short answer is that I'm pretty sure you can't. Can you change your regex so that it doesn't require overlapping?

If you still want to do this, I would try keeping track of the start and stop indices of each match made on the original string. Then go through the string and only keep characters not in any deletion range?

user1458574 · Answer

Quite efficient too is a solution coming from ... Perl combine the regexps in one:

# aptitude install regexp-assemble
$ regexp-assemble 
I am
I am foo
Ctrl + D
I am(?: foo)?

regexp-assemble takes all the variants of regexps or string you want to match and then combine them in one. And yes it changes the initial problem to another one since it is not about matching overlapping regexp anymore, but combining regexp for a match

And Then you can use it in your code:

$ python
Python 2.7.3 (default, Aug  1 2012, 05:14:39) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.sub("I am foo","I am(?: foo)?","")
''

A port of Regexp::Assemble in python would be nice :)

Thomas Jung · Answer

Here is an alternative that filters the strings on the fly using itertools.compress on the text with a selector iterator. The selector returns True if the character should be kept. selector_for_patterns creates one selector for every pattern. The selector are combined with the all function (only when all pattern want to keep a character it should be in the resulting string).

import itertools
import re

def selector_for_pattern(text, pattern):
    i = 0
    for m in re.finditer(pattern, text):
        for _ in xrange(i, m.start()):
            yield True
        for _ in xrange(m.start(), m.end()):
            yield False
        i = m.end()
    for _ in xrange(i, len(text)):
        yield True

def clean(text, patterns):
    gen = [selector_for_pattern(text, pattern) for pattern in patterns]
    selector = itertools.imap(all, itertools.izip(* gen))
    return "".join(itertools.compress(text, selector))

Combining multiple regex substitutions

Tags:

python

regex

Blender

4 Answers

Ned Batchelder

Carl Walsh

user1458574

Thomas Jung

Recent Activity

Donate For Us

Combining multiple regex substitutions

Tags:

python

regex

Blender

4 Answers

Ned Batchelder

Carl Walsh

user1458574

Thomas Jung

Related questions

Recent Activity

Donate For Us