Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Regex escape operator \ in substitutions & raw strings

I don't understand the logic in the functioning of the scape operator \ in python regex together with r' of raw strings. Some help is appreciated.

code:

import re
text=' esto  .es  10  . er - 12 .23 with [  and.Other ] here is more ; puntuation'
print('text0=',text)
text1 = re.sub(r'(\s+)([;:\.\-])', r'\2', text)
text2 = re.sub(r'\s+\.', '\.', text)
text3 = re.sub(r'\s+\.', r'\.', text)
print('text1=',text1)
print('text2=',text2)
print('text3=',text3)

The theory says: backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning.

And as far as the link provided at the end of this question explains, r' represents a raw string, i.e. there is no special meaning for symbols, it is as it stays.

so in the above regex I would expect text2 and text3 to be different, since the substitution text is '.' in text 2, i.e. a period, whereas (in principle) the substitution text in text 3 is r'.' which is a raw string, i.e. the string as it is should appear, backslash and period. But they result in the same:

The result is:

text0=  esto  .es  10  . er - 12 .23 with [  and.Other ] here is more ; puntuation
text1=  esto.es  10. er- 12.23 with [  and.Other ] here is more; puntuation
text2=  esto\.es  10\. er - 12\.23 with [  and.Other ] here is more ; puntuation
text3=  esto\.es  10\. er - 12\.23 with [  and.Other ] here is more ; puntuation
#text2=text3 but substitutions are not the same r'\.' vs '\.'

It looks to me that the r' does not work the same way in substitution part, nor the backslash. On the other hand my intuition tells me I am missing something here.

EDIT 1: Following @Wiktor Stribiżew comment. He pointed out that (following his link):

import re
print(re.sub(r'(.)(.)(.)(.)(.)(.)', 'a\6b', '123456'))
print(re.sub(r'(.)(.)(.)(.)(.)(.)', r'a\6b', '123456'))
# in my example the substitutions were not the same and the result were equal
# here indeed r' changes the results

which gives:

ab
a6b

that puzzles me even more.

Note: I read this stack overflow question about raw strings which is super complete. Nevertheless it does not speak about substitutions

like image 321
JFerro Avatar asked Jun 10 '19 09:06

JFerro


People also ask

How do you escape special characters in regex Python?

escape() was changed to escape only characters which are meaningful to regex operations. Note that re. escape will turn e.g. a newline into a backslash followed by a newline; one might well instead want a backslash followed by a lowercase n.

Do I need to escape in regex Python?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

Do I need to escape in regex?

In order to use a literal ^ at the start or a literal $ at the end of a regex, the character must be escaped. Some flavors only use ^ and $ as metacharacters when they are at the start or end of the regex respectively. In those flavors, no additional escaping is necessary. It's usually just best to escape them anyway.

How do you handle escape sequence in Python?

n Escape Sequence in Python We can use “\n” here, which tells the interpreter to print some characters in the new line separately. The above example shows that "Bit" is printed in a new line. So we can say that we will get the new line when we type \n in the string before any word or character.


1 Answers

A simple way to work around all these string escaping issues is to use a function/lambda as the repl argument, instead of a string. For example:

output = re.sub(
    pattern=find_pattern,
    repl=lambda _: replacement,
    string=input,
)

The replacement string won't be parsed at all, just substituted in place of the match.

like image 197
joerick Avatar answered Sep 28 '22 12:09

joerick