Capture the contents of a regex and delete them, efficiently

Question

Situation:

text: a string
R: a regex that matches part of the string. This might be expensive to calculate.

I want to both delete the R-matches from the text, and see what they actually contain. Currently, I do this like:

import re
ab_re = re.compile("[ab]")
text="abcdedfe falijbijie bbbb laifsjelifjl"
ab_re.findall(text)
# ['a', 'b', 'a', 'b', 'b', 'b', 'b', 'b', 'a']
ab_re.sub('',text)
# 'cdedfe flijijie  lifsjelifjl'

This runs the regex twice, near as I can tell. Is there a technique to do it all on pass, perhaps using re.split? It seems like with split based solutions I'd need to do the regex at least twice as well.

Deestan · Accepted Answer

import re

r = re.compile("[ab]")
text = "abcdedfe falijbijie bbbb laifsjelifjl"

matches = []
replaced = []
pos = 0
for m in r.finditer(text):
    matches.append(m.group(0))
    replaced.append(text[pos:m.start()])
    pos = m.end()
replaced.append(text[pos:])

print matches
print ''.join(replaced)

Outputs:

['a', 'b', 'a', 'b', 'b', 'b', 'b', 'b', 'a']
cdedfe flijijie  lifsjelifjl

Jon Cage · Answer

What about this:

import re

text = "abcdedfe falijbijie bbbb laifsjelifjl"
matches = []

ab_re = re.compile( "[ab]" )

def verboseTest( m ):
    matches.append( m.group(0) )
    return ''

textWithoutMatches = ab_re.sub( verboseTest, text )

print matches
# ['a', 'b', 'a', 'b', 'b', 'b', 'b', 'b', 'a']
print textWithoutMatches
# cdedfe flijijie  lifsjelifjl

The 'repl' argument of the re.sub function can be a function so you can report or save the matches from there and whatever the function returns is what 'sub' will substitute.

The function could easily be modified to do a lot more too! Check out the re module documentation on docs.python.org for more information on what else is possible.

Gregg Lind · Answer

My revised answer, using re.split(), which does things in one regex pass:

import re
text="abcdedfe falijbijie bbbb laifsjelifjl"
ab_re = re.compile("([ab])")
tokens = ab_re.split(text)
non_matches = tokens[0::2]
matches = tokens[1::2]

(edit: here is a complete function version)

def split_matches(text,compiled_re):
    ''' given  a compiled re, split a text 
    into matching and nonmatching sections
    returns m, n_m, two lists
    '''
    tokens = compiled_re.split(text)
    matches = tokens[1::2]
    non_matches = tokens[0::2]
    return matches,non_matches

m,nm = split_matches(text,ab_re)
''.join(nm) # equivalent to ab_re.sub('',text)

Capture the contents of a regex and delete them, efficiently

Tags:

python

regex

Gregg Lind

3 Answers

Deestan

Jon Cage

Gregg Lind

Recent Activity

Donate For Us

Capture the contents of a regex and delete them, efficiently

Tags:

python

regex

Gregg Lind

3 Answers

Deestan

Jon Cage

Gregg Lind

Related questions

Recent Activity

Donate For Us