Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is \x00 not converted to \0 by repr

Tags:

python

repr

Here is an interesting oddity about Python's repr:

The tab character \x09 is represented as \t. However this convention does not apply for the null terminator.

Why is \x00 represented as \x00, rather than \0?

Sample code:

# Some facts to make sure we are on the same page
>>> '\x31' == '1'
True
>>> '\x09' == '\t'
True
>>> '\x00' == '\0'
True

>>> x = '\x31'
>>> y = '\x09'
>>> z = '\x00'
>>> x
'1' # As Expected
>>> y
'\t' # Okay
>>> z
'\x00' # Inconsistent - why is this not \0
like image 813
Andrei Cioara Avatar asked Jan 02 '23 17:01

Andrei Cioara


1 Answers

The short answer: because that's not a specific escape that is used. String representations only use the single-character escapes \\, \n, \r, \t, (plus \' when both " and ' characters are present) because there are explicit tests for those.

The rest is either considered printable and included as-is, or included using a longer escape sequence (depending on the Python version and string type, \xhh, \uhhhh and \Uhhhhhhhh, always using the shortest of the 3 options that'll fit the value).

Moreover, when generating the repr() output, for a string consisting of a null byte followed by a digit from '1' through to '7' (so bytes([0x00, 0x49]), or bytes([0x00, 0x4A]), etc), you can't just use \0 in the output without then also having to escape the following digit. '\01' is a single octal escape sequence, and not the same value as '\x001', which is two bytes. While forcing the output to always use three octal digits (e.g. '\0001') could be a work-around, it is just simpler to stick to a standardised, simpler escape sequence format. Scanning ahead to see if the next character is an octal digit and switching output styles would just produce confusing output (imagine the question on SO: What is the difference between '\x001' and '\0Ol'?)

The output is always consistent. Apart from the single quote (which can appear either with ' or \', depending on the presence of " characters), Python will always use same escape sequence style for a given codepoint.

If you want to study the code that produces the output, you can find the Python 3 str.__repr__ implementation in the Objects/unicodeobject.c unicode_repr() function, which uses

/* Escape quotes and backslashes */
if ((ch == quote) || (ch == '\\')) {
    PyUnicode_WRITE(okind, odata, o++, '\\');
    PyUnicode_WRITE(okind, odata, o++, ch);
    continue;
}


/* Map special whitespace to '\t', \n', '\r' */
if (ch == '\t') {
    PyUnicode_WRITE(okind, odata, o++, '\\');
    PyUnicode_WRITE(okind, odata, o++, 't');
}
else if (ch == '\n') {
    PyUnicode_WRITE(okind, odata, o++, '\\');
    PyUnicode_WRITE(okind, odata, o++, 'n');
}
else if (ch == '\r') {
    PyUnicode_WRITE(okind, odata, o++, '\\');
    PyUnicode_WRITE(okind, odata, o++, 'r');
}

for single-character escapes, followed by additional checks longer escapes below. For Python 2, a similar but shorter PyString_Repr() function does much the same thing.

like image 198
Martijn Pieters Avatar answered Jan 13 '23 22:01

Martijn Pieters