Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why don't backreferences work in Python's re.sub when using a replacement function?

Using re.sub in Python 2.7, the following example uses a simple backreference:

re.sub('-{1,2}', r'\g<0> ', 'pro----gram-files')

It outputs the following string as expected:

'pro-- -- gram- files'

I would expect the following example to be identical, but it is not:

def dashrepl(matchobj):
    return r'\g<0> '
re.sub('-{1,2}', dashrepl, 'pro----gram-files')

This gives the following unexpected output:

'pro\\g<0> \\g<0> gram\\g<0> files'

Why do the two examples give different output? Did I miss something in the documentation that explains this? Is there any particular reason that this behavior is preferable to what I expected? Is there a way to use backreferences in a replacement function?

like image 601
amcnabb Avatar asked Oct 18 '12 16:10

amcnabb


2 Answers

As there are simpler ways to achieve your goal, you can use them.

As you already see, your replacement function gets a match object as it argument.

This object has, among others, a method group() which can be used instead:

def dashrepl(matchobj):
    return matchobj.group(0) + ' '

which will give exactly your result.


But you are completely right - the docs are a bit confusing in that way:

they describe the repl argument:

repl can be a string or a function; if it is a string, any backslash escapes in it are processed.

and

If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.

You could interpret this as if "the replacement string" returned by the function would also apply to the processment of backslash escapes.

But as this processment is described only for the case that "it is a string", it becomes clearer, but not obvious at the first glance.

like image 66
glglgl Avatar answered Sep 24 '22 07:09

glglgl


If you pass in a function to re.sub, it allows you to replace the match with the string that is returned from the function. Basically, re.sub uses different code paths depending on if you pass a function or a string. And yes, this is in fact desireable. Consider the case where you want to replace matches of foo with bar and matches of baz with qux. You can then write it as:

repdict = {'foo':'bar','baz':'qux'}
re.sub('foo|baz',lambda match: repdict[match.group(0)],'foo')

You could argue that you could do this in 2 passes, but you can't do that if repdict looks like {'foo':'baz','baz':'qux'}

And I don't think you can do that with back-references (at least not easily).

like image 26
mgilson Avatar answered Sep 21 '22 07:09

mgilson