Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capture the contents of a regex and delete them, efficiently

Tags:

python

regex

Situation:

  • text: a string
  • R: a regex that matches part of the string. This might be expensive to calculate.

I want to both delete the R-matches from the text, and see what they actually contain. Currently, I do this like:

import re
ab_re = re.compile("[ab]")
text="abcdedfe falijbijie bbbb laifsjelifjl"
ab_re.findall(text)
# ['a', 'b', 'a', 'b', 'b', 'b', 'b', 'b', 'a']
ab_re.sub('',text)
# 'cdedfe flijijie  lifsjelifjl'

This runs the regex twice, near as I can tell. Is there a technique to do it all on pass, perhaps using re.split? It seems like with split based solutions I'd need to do the regex at least twice as well.

like image 846
Gregg Lind Avatar asked Oct 15 '08 14:10

Gregg Lind


3 Answers

import re

r = re.compile("[ab]")
text = "abcdedfe falijbijie bbbb laifsjelifjl"

matches = []
replaced = []
pos = 0
for m in r.finditer(text):
    matches.append(m.group(0))
    replaced.append(text[pos:m.start()])
    pos = m.end()
replaced.append(text[pos:])

print matches
print ''.join(replaced)

Outputs:

['a', 'b', 'a', 'b', 'b', 'b', 'b', 'b', 'a']
cdedfe flijijie  lifsjelifjl
like image 57
Deestan Avatar answered Oct 06 '22 11:10

Deestan


What about this:

import re

text = "abcdedfe falijbijie bbbb laifsjelifjl"
matches = []

ab_re = re.compile( "[ab]" )

def verboseTest( m ):
    matches.append( m.group(0) )
    return ''

textWithoutMatches = ab_re.sub( verboseTest, text )

print matches
# ['a', 'b', 'a', 'b', 'b', 'b', 'b', 'b', 'a']
print textWithoutMatches
# cdedfe flijijie  lifsjelifjl

The 'repl' argument of the re.sub function can be a function so you can report or save the matches from there and whatever the function returns is what 'sub' will substitute.

The function could easily be modified to do a lot more too! Check out the re module documentation on docs.python.org for more information on what else is possible.

like image 27
Jon Cage Avatar answered Oct 06 '22 11:10

Jon Cage


My revised answer, using re.split(), which does things in one regex pass:

import re
text="abcdedfe falijbijie bbbb laifsjelifjl"
ab_re = re.compile("([ab])")
tokens = ab_re.split(text)
non_matches = tokens[0::2]
matches = tokens[1::2]

(edit: here is a complete function version)

def split_matches(text,compiled_re):
    ''' given  a compiled re, split a text 
    into matching and nonmatching sections
    returns m, n_m, two lists
    '''
    tokens = compiled_re.split(text)
    matches = tokens[1::2]
    non_matches = tokens[0::2]
    return matches,non_matches

m,nm = split_matches(text,ab_re)
''.join(nm) # equivalent to ab_re.sub('',text)
like image 32
Gregg Lind Avatar answered Oct 06 '22 12:10

Gregg Lind