Confused about backslashes in regular expressions [duplicate]

Tags:

I am confused with the backslash in regular expressions. Within a regex a \ has a special meaning, e.g. \d means a decimal digit. If you add a backslash in front of the backslash this special meaning gets lost. In the regex-howto one can read:

Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.

So print(re.search('\d', '\d')) gives None because \d matches any decimal digit but there is none in \d.

I now would expect print(re.search('\\d', '\d')) to match \d but the answer is still None.

Only print(re.search('\\\d', '\d')) gives as output <_sre.SRE_Match object; span=(0, 2), match='\\d'>.

Does someone have an explanation?

754

asked Nov 07 '15 11:11

tobmei05

2 Answers

The confusion is due to the fact that the backslash character \ is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \ before the re module ever sees your string. For instance, \n is converted to a newline character, \t is converted to a tab character, etc. To get an actual \ character, you can escape it as well, so \\ gives a single \ character. If the character following the \ isn't a recognized escape character, then the \ is treated like any other character and passed through, but I don't recommend depending on this. Instead, always escape your \ characters by doubling them, i.e. \\.

If you want to see how Python is expanding your string escapes, just print out the string. For example:

s = 'a\\b\tc' print(s)

If s is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \ escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \ escapes.

Once you know how your string is being encoded, you can then think about what the re module will do with it. For instance, if you want to escape \ in a string you pass to the re module, you will need to pass \\ to re, which means you will need to use \\\\ in your quoted Python string. The Python string will end up with \\ and the re module will treat this as a single literal \ character.

An alternative way to include \ characters in Python strings is to use raw strings, e.g. r'a\b' is equivalent to "a\\b".

answered Oct 09 '22 06:10

Tom Karzes

An r character before the regular expression in a call to search() specifies that the regular expression is a raw string. This allows backslashes to be used in the regular expression as regular characters rather than in an escape sequence of characters. Let me explain ...

Before the re module's search method processes the strings that are passed to it, the Python interpreter takes an initial pass over the string. If there are backslashes present in a string, the Python interpreter must decide if each is part of a Python escape sequence (e.g. \n or \t) or not.

Note: at this point Python does not care whether or not '\' is a regular expression meta-character.

If the '\' is followed by a recognized Python escape character (t,n, etc.), then the backslash and the escape character are replaced with the actual Unicode or 8-bit character. For example, '\t' would be replaced with the ASCII character for tab. Otherwise it is passed by and interpreted as a '\' character.

Consider the following.

>>> s = '\t' >>> print ("[" + s  + "]") >>> [       ]           // an actual tab character after preprocessing  >>> s = '\d' >>> print ("[" + s  + "]") >>> [\d]                // '\d' after preprocessing

Sometimes we want to include in a string a character sequence that includes '\' without it being interpreted by Python as an escape sequence. To do this we escape the '\' with a '\'. Now when Python sees '\' it replaces the two backslashes with a single '\' character.

>>> s = '\\t' >>> print ("[" + s  + "]") >>> [\t]                // '\t' after preprocessing

After the Python interpreter take a pass on both strings, they are passed to the re module's search method. The search method parses the regular expression string to identify the regular expression's meta-characters.

Now '\' is also a special regular expression meta-character and is interpreted as one UNLESS it is escaped at the time that the re search() method is executed.

Consider the following call.

>>> match = re.search('a\\t','a\\t')        //Match is None

Here, match is None. Why? Lets look at the strings after the Python interpreter makes its pass.

String 1: 'a\t' String 2: 'a\t'

So why is match equal to None? When search() interprets String 1, since it is a regular expression, the backslash is interpreted as a meta-character, not an ordinary character. The backslash in String 2 however is not in a regular expression and has already been processed by the Python interpreter, so it is interpreted as an ordinary character.

So the search() method is looking for 'a escape-t' in the string 'a\t' which are not a match.

To fix this we can tell the search() method to not interpret the '\' as a meta-character. We can do this by escaping it.

Consider the following call.

>>> match = re.search('a\\\\t','a\\t')          // Match contains 'a\t'

Again, lets look at the strings after the Python interpreter has made its pass.

String 1: 'a\\t' String 2: 'a\t'

Now when the search() method processes the regular expression, it sees that the second backslash is escaped by the first and should not be considered a meta-character. It therefore interprets the string as 'a\t', which matches String 2.

An alternate way to have search() consider '\' as a character is to place an r before the regular expression. This tells the Python interpreter to NOT preprocess the string.

Consider this.

>>> match = re.search(r'a\\t','a\\t')           // match contains 'a\t'

Here the Python interpreter does not modify the first string but does process the second string. The strings passed to search() are:

String 1: 'a\\t' String 2: 'a\t'

As in the previous example, search interprets the '\' as the single character '\' and not a meta-character, thus matches with String 2.

answered Oct 09 '22 06:10

eric.mcgregor

Related questions
                            
                                matplotlib hatched fill_between without edges?
                            
                                Python modules with submodules and functions
                            
                                Limiting/throttling the rate of HTTP requests in GRequests
                            
                                Why does Python handle '1 is 1**2' differently from '1000 is 10**3'?
                            
                                python - RGB matrix of an image
                            
                                Downloading a file from google cloud storage inside a folder
                            
                                How to get default blue colour of matplotlib.pyplot.scatter?
                            
                                What is the default weight initializer in Keras?
                            
                                How to hash a large object (dataset) in Python?
                            
                                When will Django support Python 3.x?
                            
                                How to convert a string from CP-1251 to UTF-8?
                            
                                Exception in Thread:must be a sequence, not instance
                            
                                How to check if value is nan in unittest?
                            
                                how to discriminate based on HTTP method in django urlpatterns
                            
                                How exactly does addStretch work in QBoxLayout?
                            
                                pygame installation issue in mac os
                            
                                Python sci-kit learn (metrics): difference between r2_score and explained_variance_score?
                            
                                Python: What is the difference between math.exp and numpy.exp and why do numpy creators choose to introduce exp again
                            
                                sklearn LogisticRegression and changing the default threshold for classification
                            
                                Is there any way to clear django.db.connection.queries?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Confused about backslashes in regular expressions [duplicate]

Tags:

python

regex

tobmei05

People also ask

2 Answers

Tom Karzes

eric.mcgregor

Recent Activity

Donate For Us