Here is an interesting oddity about Python's repr:
The tab character \x09
is represented as \t
. However this convention does not apply for the null terminator.
Why is \x00
represented as \x00
, rather than \0
?
Sample code:
# Some facts to make sure we are on the same page
>>> '\x31' == '1'
True
>>> '\x09' == '\t'
True
>>> '\x00' == '\0'
True
>>> x = '\x31'
>>> y = '\x09'
>>> z = '\x00'
>>> x
'1' # As Expected
>>> y
'\t' # Okay
>>> z
'\x00' # Inconsistent - why is this not \0
The short answer: because that's not a specific escape that is used. String representations only use the single-character escapes \\
, \n
, \r
, \t
, (plus \'
when both "
and '
characters are present) because there are explicit tests for those.
The rest is either considered printable and included as-is, or included using a longer escape sequence (depending on the Python version and string type, \xhh
, \uhhhh
and \Uhhhhhhhh
, always using the shortest of the 3 options that'll fit the value).
Moreover, when generating the repr()
output, for a string consisting of a null byte followed by a digit from '1'
through to '7'
(so bytes([0x00, 0x49])
, or bytes([0x00, 0x4A])
, etc), you can't just use \0
in the output without then also having to escape the following digit. '\01'
is a single octal escape sequence, and not the same value as '\x001'
, which is two bytes. While forcing the output to always use three octal digits (e.g. '\0001'
) could be a work-around, it is just simpler to stick to a standardised, simpler escape sequence format. Scanning ahead to see if the next character is an octal digit and switching output styles would just produce confusing output (imagine the question on SO: What is the difference between '\x001'
and '\0Ol'
?)
The output is always consistent. Apart from the single quote (which can appear either with '
or \'
, depending on the presence of "
characters), Python will always use same escape sequence style for a given codepoint.
If you want to study the code that produces the output, you can find the Python 3 str.__repr__
implementation in the Objects/unicodeobject.c
unicode_repr()
function, which uses
/* Escape quotes and backslashes */
if ((ch == quote) || (ch == '\\')) {
PyUnicode_WRITE(okind, odata, o++, '\\');
PyUnicode_WRITE(okind, odata, o++, ch);
continue;
}
/* Map special whitespace to '\t', \n', '\r' */
if (ch == '\t') {
PyUnicode_WRITE(okind, odata, o++, '\\');
PyUnicode_WRITE(okind, odata, o++, 't');
}
else if (ch == '\n') {
PyUnicode_WRITE(okind, odata, o++, '\\');
PyUnicode_WRITE(okind, odata, o++, 'n');
}
else if (ch == '\r') {
PyUnicode_WRITE(okind, odata, o++, '\\');
PyUnicode_WRITE(okind, odata, o++, 'r');
}
for single-character escapes, followed by additional checks longer escapes below. For Python 2, a similar but shorter PyString_Repr()
function does much the same thing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With