Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python re.sub: ignore backreferences in the replacement string

I want to replace a pattern with a string. The string is given in a variable. It might, of course, contains '\1', and it should not be interpreted as a backreference - but simply as \1.

How can I achieve that?

like image 622
max Avatar asked Dec 09 '11 06:12

max


2 Answers

The previous answer using re.escape() would escape too much, and you would get undesirable backslashes in the replacement and the replaced string.

It seems like in Python only the backslash needs escaping in the replacement string, thus something like this could be sufficient:

replacement = replacement.replace("\\", "\\\\")

Example:

import re

x = r'hai! \1 <ops> $1 \' \x \\'
print "want to see: "
print x

print "getting: "
print re.sub(".(.).", x, "###")
print "over escaped: "
print re.sub(".(.).", re.escape(x), "###")
print "could work: "
print re.sub(".(.).", x.replace("\\", "\\\\"), "###")

Output:

want to see: 
hai! \1 <ops> $1 \' \x \\
getting: 
hai! # <ops> $1 \' \x \
over escaped: 
hai\!\ \1\ \<ops\>\ \$1\ \\'\ \x\ \\
could work: 
hai! \1 <ops> $1 \' \x \\
like image 97
Qtax Avatar answered Nov 15 '22 05:11

Qtax


Due to comments I thought quite a while about this and tried it out. Helped me a lot to increase my understanding about escaping, so I changed my answer nearly completely that it could be useful for later readers.

NullUserException gave you just the short version, I try to explain it a bit more. And thanks to the critical reviews of Qtax and Duncan, this answer is hopefully now correct and helpful.

The backslash has a special meaning, its the escape character in strings, that means the backslash and the following character form an escape sequence that is translated to something else when something is done with the string. This "something is done" is already the creation of the string. So if you want to use \ literally you need to escape it. This escape character is the backslash itself.

So as start some examples for a better understanding what happens. I print additionally the ASCII codes of the characters in the string to hopefully increase the understandability of what happens.

s = "A\1\nB"
print s
print [x for x in s]
print [hex(ord(x)) for x in s]

is printing

A
B
['A', '\x01', '\n', 'B']
['0x41', '0x1', '0xa', '0x42']

So while I typed \ and 1 in the code, s does not contain those two characters, it contains the ASCII character 0x01 which is "Start of heading". Same for the \n, it translated to 0x0a the Linefeed character.

Since this behaviour is not always wanted, raw strings can be used, where the escape sequences are ignored.

s = r"A\1\nB"
print s
print [x for x in s]
print [hex(ord(x)) for x in s]

I just added the r before the string and the result is now

A\1\nB
['A', '\\', '1', '\\', 'n', 'B']
['0x41', '0x5c', '0x31', '0x5c', '0x6e', '0x42']

All characters are printed as I typed them.

This is the situation we have. Now there is the next thing.

There can be the situation that a string should be passed to a regex to be found literally, so every character that has a special meaning within a regex (e.g. +*$[.) needs to escaped, therefore there is a special function re.escape that does this job.

But for this question this is the wrong function, because the string should not be used within a regex, but as the replacement string for re.sub.

So new situation:

A raw string including escape sequences should be used as replacement string for re.sub. re.sub will also handle the escape sequences, but with a small, but important, difference to the handling before: \n is still translated to 0x0a the Linefeed character, but the transition of \1 has changed now! It will be replaced by the content of the capturing group 1 of the regex in re.sub.

s = r"A\1\nB"
print re.sub(r"(Replace)" ,s , "1 Replace 2")

And the result is

1 AReplace
B 2

The \1 has been replaced with the content of the capturing group and \n with the LineFeed character.

The important point is, you have to understand this behaviour and now you have two possiblities to my opinion (and I am not going to judge which one is the correct one)

  1. The creator is unsure about the string behaviour and if he inputs \n then he wants a newline. In this case use this to just escape the \ that are followed by a digit.

    OnlyDigits = re.sub(r"(Replace)" ,re.sub(r"(\\)(?=\d)", r"\\\\", s) , "1 Replace 2")
    print OnlyDigits
    print [x for x in OnlyDigits]
    print [hex(ord(x)) for x in OnlyDigits
    

    Output:

    1 A\1
    B 2
    ['1', ' ', 'A', '\\', '1', '\n', 'B', ' ', '2']
    ['0x31', '0x20', '0x41', '0x5c', '0x31', '0xa', '0x42', '0x20', '0x32']
    
  2. The creator nows exactly what he is doing and if he would have wanted a newline, he would have typed \0xa. In this case escape all

    All = re.sub(r"(Replace)" ,re.sub(r"(\\)", r"\\\\", s) , "1 Replace 2")
    print All
    print [x for x in All]
    print [hex(ord(x)) for x in All]
    

    Output:

    1 A\1\nB 2
    ['1', ' ', 'A', '\\', '1', '\\', 'n', 'B', ' ', '2']
    ['0x31', '0x20', '0x41', '0x5c', '0x31', '0x5c', '0x6e', '0x42', '0x20', '0x32']
    
like image 21
stema Avatar answered Nov 15 '22 06:11

stema