I'm confused about raw string in the following code: <pre class="prettyprint"><code>import re text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.' text2_re = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2) print (text2_re) #output: Today is 2012-11-27. PyCon starts 2013-3-13. print (r'(\d+)/(\d+)/(\d+)') #output: (\d+)/(\d+)/(\d+) </code></pre> As I understand raw string, without r, the \ is treated as an escape character; with r, the backslash \ is treated literally as itself (a backslash). However, what I cannot understand in the above code is that: <ul> <li>In the regular expression Line 5, even though there is a r, the "\d" inside is treated as one number [0-9] instead of one backslash \ plus one letter d.</li> <li>In the second print Line 8, all characters are treated as raw strings.</li> </ul> What is the difference? <h3>Additional Edition:</h3> I made the following four variations, with or without r: <pre class="prettyprint"><code>import re text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.' text2_re = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2) text2_re1 = re.sub('(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2) text2_re2 = re.sub(r'(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2) text2_re3 = re.sub('(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2) print (text2_re) print (text2_re1) print (text2_re2) print (text2_re3) </code></pre> And get the following output: <img src="https://i.stack.imgur.com/pAG1f.png" alt=""> Could you explain these Four situations specifically?

You're getting confused by the difference between a string and a string literal. A string literal is what you put between <code>"</code> or <code>'</code> and the python interpreter parses this string and puts it into memory. If you mark your string literal as a raw string literal (using <code>r'</code>) then the python interpreter will not change the representation of that string before putting it into memory but once they've been parsed they are stored exactly the same way. This means that in memory there is no such thing as a raw string. Both the following strings are stored identically in memory with no concept of whether they were raw or not. <pre class="prettyprint"><code>r'a regex digit: \d' # a regex digit: \d 'a regex digit: \\d' # a regex digit: \d </code></pre> Both these strings contain <code>\d</code> and there is nothing to say that this came from a raw string. So when you pass this string to the <code>re</code> module it sees that there is a <code>\d</code> and sees it as a digit because the <code>re</code> module does not know that the string came from a raw string literal. In your specific example, to get a literal backslash followed by a literal d you would use <code>\\d</code> like so: <pre class="prettyprint"><code>import re text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.' text2_re = re.sub(r'(\\d+)/(\\d+)/(\\d+)', r'\3-\1-\2', text2) print (text2_re) #output: Today is 11/27/2012. PyCon starts 3/13/2013. </code></pre> Alternatively, without using raw strings: <pre class="prettyprint"><code>import re text = 'Today is 11/27/2012. PyCon starts 3/13/2013.' text_re = re.sub('(\\d+)/(\\d+)/(\\d+)', '\\3-\\1-\\2', text2) print (text_re) #output: Today is 2012-11-27. PyCon starts 2013-3-13. text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.' text2_re = re.sub('(\\\\d+)/(\\\\d+)/(\\\\d+)', '\\3-\\1-\\2', text2) print (text2_re) #output: Today is 11/27/2012. PyCon starts 3/13/2013. </code></pre> I hope that helps somewhat. Edit: I didn't want to complicate things but because <code>\d</code> is not a valid escape sequence python does not change it, so <code>'\d' == r'\d'</code> is true. Since <code>\\</code> is a valid escape sequence it gets changed to <code>\</code>, so you get the behaviour <code>'\d' == '\\d' == r'\d'</code>. Strings get confusing sometimes. Edit2: To answer your edit, let's look at each line specifically: <pre class="prettyprint"><code>text2_re = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2) </code></pre> <code>re.sub</code> receives the two strings <code>(\d+)/(\d+)/(\d+)</code> and <code>\3-\1-\2</code>. Hopefully this behaves as you expect now. <pre class="prettyprint"><code>text2_re1 = re.sub('(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2) </code></pre> Again (because <code>\d</code> is not a valid string escape it doesn't get changed, see my first edit) <code>re.sub</code> receives the two strings <code>(\d+)/(\d+)/(\d+)</code> and <code>\3-\1-\2</code>. Since <code>\d</code> doesn't get changed by the python interpreter <code>r'(\d+)/(\d+)/(\d+)' == '(\d+)/(\d+)/(\d+)'</code>. If you understand my first edit then hopefully you should understand why these two cases behave the same. <pre class="prettyprint"><code>text2_re2 = re.sub(r'(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2) </code></pre> This case is a bit different because <code>\1</code>, <code>\2</code> and <code>\3</code> are all valid escape sequences, they are replaced with the unicode character whose decimal representation is given by the number. That's quite complex but it basically boils down to: <pre class="prettyprint"><code>\1 # stands for the ascii start-of-heading character \2 # stands for the ascii start-of-text character \3 # stands for the ascii end-of-text character </code></pre> This means that <code>re.sub</code> receives the first string as it has done in the first two examples (<code>(\d+)/(\d+)/(\d+)</code>) but the second string is actually <code><start-of-heading>/<start-of-text>/<end-of-text></code>. So <code>re.sub</code> replaces the match with that second string exactly but since none of the three (<code>\1</code>, <code>\2</code> or <code>\3</code>) are printable characters python just prints a stock place-holder character instead. <pre class="prettyprint"><code>text2_re3 = re.sub('(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2) </code></pre> This behaves like the third example because <code>r'(\d+)/(\d+)/(\d+)' == '(\d+)/(\d+)/(\d+)'</code>, as explained in the second example.

Raw string and regular expression in Python

Tags:

python

regex

escaping

rawstring

backslash

I'm confused about raw string in the following code:

import re

text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text2_re = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)
print (text2_re) #output: Today is 2012-11-27. PyCon starts 2013-3-13.

print (r'(\d+)/(\d+)/(\d+)') #output: (\d+)/(\d+)/(\d+)

As I understand raw string, without r, the \ is treated as an escape character; with r, the backslash \ is treated literally as itself (a backslash).

However, what I cannot understand in the above code is that:

In the regular expression Line 5, even though there is a r, the "\d" inside is treated as one number [0-9] instead of one backslash \ plus one letter d.
In the second print Line 8, all characters are treated as raw strings.

What is the difference?

Additional Edition:

I made the following four variations, with or without r:

import re

text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text2_re = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)
text2_re1 = re.sub('(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)
text2_re2 = re.sub(r'(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)
text2_re3 = re.sub('(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)

print (text2_re)
print (text2_re1)
print (text2_re2)
print (text2_re3)

And get the following output:

Could you explain these Four situations specifically?

318

asked May 11 '15 09:05

fluency03

1 Answers

You're getting confused by the difference between a string and a string literal.

A string literal is what you put between " or ' and the python interpreter parses this string and puts it into memory. If you mark your string literal as a raw string literal (using r') then the python interpreter will not change the representation of that string before putting it into memory but once they've been parsed they are stored exactly the same way.

This means that in memory there is no such thing as a raw string. Both the following strings are stored identically in memory with no concept of whether they were raw or not.

r'a regex digit: \d'  # a regex digit: \d
'a regex digit: \\d'  # a regex digit: \d

Both these strings contain \d and there is nothing to say that this came from a raw string. So when you pass this string to the re module it sees that there is a \d and sees it as a digit because the re module does not know that the string came from a raw string literal.

In your specific example, to get a literal backslash followed by a literal d you would use \\d like so:

import re

text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text2_re = re.sub(r'(\\d+)/(\\d+)/(\\d+)', r'\3-\1-\2', text2)
print (text2_re) #output: Today is 11/27/2012. PyCon starts 3/13/2013.

Alternatively, without using raw strings:

import re

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text_re = re.sub('(\\d+)/(\\d+)/(\\d+)', '\\3-\\1-\\2', text2)
print (text_re) #output: Today is 2012-11-27. PyCon starts 2013-3-13.

text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text2_re = re.sub('(\\\\d+)/(\\\\d+)/(\\\\d+)', '\\3-\\1-\\2', text2)
print (text2_re) #output: Today is 11/27/2012. PyCon starts 3/13/2013.

I hope that helps somewhat.

Edit: I didn't want to complicate things but because \d is not a valid escape sequence python does not change it, so '\d' == r'\d' is true. Since \\ is a valid escape sequence it gets changed to \, so you get the behaviour '\d' == '\\d' == r'\d'. Strings get confusing sometimes.

Edit2: To answer your edit, let's look at each line specifically:

text2_re = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)

re.sub receives the two strings (\d+)/(\d+)/(\d+) and \3-\1-\2. Hopefully this behaves as you expect now.

text2_re1 = re.sub('(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)

Again (because \d is not a valid string escape it doesn't get changed, see my first edit) re.sub receives the two strings (\d+)/(\d+)/(\d+) and \3-\1-\2. Since \d doesn't get changed by the python interpreter r'(\d+)/(\d+)/(\d+)' == '(\d+)/(\d+)/(\d+)'. If you understand my first edit then hopefully you should understand why these two cases behave the same.

text2_re2 = re.sub(r'(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)

This case is a bit different because \1, \2 and \3 are all valid escape sequences, they are replaced with the unicode character whose decimal representation is given by the number. That's quite complex but it basically boils down to:

\1  # stands for the ascii start-of-heading character
\2  # stands for the ascii start-of-text character
\3  # stands for the ascii end-of-text character

This means that re.sub receives the first string as it has done in the first two examples ((\d+)/(\d+)/(\d+)) but the second string is actually <start-of-heading>/<start-of-text>/<end-of-text>. So re.sub replaces the match with that second string exactly but since none of the three (\1, \2 or \3) are printable characters python just prints a stock place-holder character instead.

text2_re3 = re.sub('(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)

This behaves like the third example because r'(\d+)/(\d+)/(\d+)' == '(\d+)/(\d+)/(\d+)', as explained in the second example.

115

answered Oct 06 '22 00:10

Sean1708

Related questions
                            
                                Classification tree in sklearn giving inconsistent answers
                            
                                How to get decode attachment filename with python email?
                            
                                Efficient way to create term density matrix from pandas DataFrame
                            
                                How can you convert QDate in PyQt5 to a datetime.date?
                            
                                Python - make string equal length
                            
                                Python Append List Repeats Last Element?
                            
                                Searching Google with Selenium and Python
                            
                                Create range without certain numbers
                            
                                How to round up a complex number?
                            
                                return the second instance of a regex search in a line
                            
                                Open a protected pdf file in python
                            
                                How to find neighbors of a 2D list in python?
                            
                                Formatting a nested json for use with Python Requests
                            
                                Pandas: np.where with multiple conditions on dataframes
                            
                                Python Facebook API - cursor pagination
                            
                                How to calculate average with Django?
                            
                                How to get read excel data into an array with python
                            
                                Pip broken on Ubuntu 14.4 after package upgrade
                            
                                Tagging a single word with the nltk pos tagger tags each letter instead of the word
                            
                                Creating a truth table for any expression in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With