I'm trying to delete some things from a block of text using regex. I have all of my patterns ready, but I can't seem to be able to remove two (or more) that overlap.
For example:
import re
r1 = r'I am'
r2 = r'am foo'
text = 'I am foo'
re.sub(r1, '', text) # Returns ' foo'
re.sub(r2, '', text) # Returns 'I '
How do I replace both of the occurrences simultaneously and end up with an empty string?
I ended up using a slightly modified version of Ned Batchelder's answer:
def clean(self, text):
mask = bytearray(len(text))
for pattern in patterns:
for match in re.finditer(pattern, text):
r = range(match.start(), match.end())
mask[r] = 'x' * len(r)
return ''.join(character for character, bit in zip(text, mask) if not bit)
You can't do it with consecutive re.sub
calls as you have shown. You can use re.finditer
to find them all. Each match will provide you with a match object, which has .start
and .end
attributes indicating their positions. You can collect all those together, and then remove characters at the end.
Here I use a bytearray
as a mutable string, used as a mask. It's initialized to zero bytes, and I mark with an 'x' all the bytes that match any regex. Then I use the bit mask to select the characters to keep in the original string, and build a new string with only the unmatched characters:
bits = bytearray(len(text))
for pat in patterns:
for m in re.finditer(pat, text):
bits[m.start():m.end()] = 'x' * (m.end()-m.start())
new_string = ''.join(c for c,bit in zip(text, bits) if not bit)
Not to be a downer, but the short answer is that I'm pretty sure you can't. Can you change your regex so that it doesn't require overlapping?
If you still want to do this, I would try keeping track of the start and stop indices of each match made on the original string. Then go through the string and only keep characters not in any deletion range?
Quite efficient too is a solution coming from ... Perl combine the regexps in one:
# aptitude install regexp-assemble
$ regexp-assemble
I am
I am foo
Ctrl + D
I am(?: foo)?
regexp-assemble takes all the variants of regexps or string you want to match and then combine them in one. And yes it changes the initial problem to another one since it is not about matching overlapping regexp anymore, but combining regexp for a match
And Then you can use it in your code:
$ python
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.sub("I am foo","I am(?: foo)?","")
''
A port of Regexp::Assemble in python would be nice :)
Here is an alternative that filters the strings on the fly using itertools.compress
on the text with a selector iterator. The selector returns True
if the character should be kept. selector_for_patterns
creates one selector for every pattern. The selector are combined with the all function (only when all pattern want to keep a character it should be in the resulting string).
import itertools
import re
def selector_for_pattern(text, pattern):
i = 0
for m in re.finditer(pattern, text):
for _ in xrange(i, m.start()):
yield True
for _ in xrange(m.start(), m.end()):
yield False
i = m.end()
for _ in xrange(i, len(text)):
yield True
def clean(text, patterns):
gen = [selector_for_pattern(text, pattern) for pattern in patterns]
selector = itertools.imap(all, itertools.izip(* gen))
return "".join(itertools.compress(text, selector))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With