Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining multiple regex substitutions

Tags:

python

regex

I'm trying to delete some things from a block of text using regex. I have all of my patterns ready, but I can't seem to be able to remove two (or more) that overlap.

For example:

import re

r1 = r'I am'
r2 = r'am foo'

text = 'I am foo'

re.sub(r1, '', text)   # Returns ' foo'
re.sub(r2, '', text)   # Returns 'I '

How do I replace both of the occurrences simultaneously and end up with an empty string?


I ended up using a slightly modified version of Ned Batchelder's answer:

def clean(self, text):
  mask = bytearray(len(text))

  for pattern in patterns:
    for match in re.finditer(pattern, text):
      r = range(match.start(), match.end())

      mask[r] = 'x' * len(r)

  return ''.join(character for character, bit in zip(text, mask) if not bit)
like image 265
Blender Avatar asked Jul 11 '12 22:07

Blender


4 Answers

You can't do it with consecutive re.sub calls as you have shown. You can use re.finditer to find them all. Each match will provide you with a match object, which has .start and .end attributes indicating their positions. You can collect all those together, and then remove characters at the end.

Here I use a bytearray as a mutable string, used as a mask. It's initialized to zero bytes, and I mark with an 'x' all the bytes that match any regex. Then I use the bit mask to select the characters to keep in the original string, and build a new string with only the unmatched characters:

bits = bytearray(len(text))
for pat in patterns:
    for m in re.finditer(pat, text):
        bits[m.start():m.end()] = 'x' * (m.end()-m.start())
new_string = ''.join(c for c,bit in zip(text, bits) if not bit)
like image 69
Ned Batchelder Avatar answered Nov 17 '22 18:11

Ned Batchelder


Not to be a downer, but the short answer is that I'm pretty sure you can't. Can you change your regex so that it doesn't require overlapping?

If you still want to do this, I would try keeping track of the start and stop indices of each match made on the original string. Then go through the string and only keep characters not in any deletion range?

like image 36
Carl Walsh Avatar answered Nov 17 '22 18:11

Carl Walsh


Quite efficient too is a solution coming from ... Perl combine the regexps in one:

# aptitude install regexp-assemble
$ regexp-assemble 
I am
I am foo
Ctrl + D
I am(?: foo)?

regexp-assemble takes all the variants of regexps or string you want to match and then combine them in one. And yes it changes the initial problem to another one since it is not about matching overlapping regexp anymore, but combining regexp for a match

And Then you can use it in your code:

$ python
Python 2.7.3 (default, Aug  1 2012, 05:14:39) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.sub("I am foo","I am(?: foo)?","")
''

A port of Regexp::Assemble in python would be nice :)

like image 38
user1458574 Avatar answered Nov 17 '22 18:11

user1458574


Here is an alternative that filters the strings on the fly using itertools.compress on the text with a selector iterator. The selector returns True if the character should be kept. selector_for_patterns creates one selector for every pattern. The selector are combined with the all function (only when all pattern want to keep a character it should be in the resulting string).

import itertools
import re

def selector_for_pattern(text, pattern):
    i = 0
    for m in re.finditer(pattern, text):
        for _ in xrange(i, m.start()):
            yield True
        for _ in xrange(m.start(), m.end()):
            yield False
        i = m.end()
    for _ in xrange(i, len(text)):
        yield True

def clean(text, patterns):
    gen = [selector_for_pattern(text, pattern) for pattern in patterns]
    selector = itertools.imap(all, itertools.izip(* gen))
    return "".join(itertools.compress(text, selector))
like image 38
Thomas Jung Avatar answered Nov 17 '22 18:11

Thomas Jung