Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Default encoding of exception messages

The following code examines the behaviour of the float() method when fed a non-ascii symbol:

import sys

try:
  float(u'\xbd')
except ValueError as e:
  print sys.getdefaultencoding() # in my system, this is 'ascii'
  print e[0].decode('latin-1') # u'invalid literal for float(): ' followed by the 1/2 (one half) character
  print unicode(e[0]) # raises "UnicodeDecodeError: 'ascii' codec can't decode byte 0xbd in position 29: ordinal not in range(128)"

My question: why is the error message e[0] encoded in Latin-1? The default encoding is Ascii, and this seems to be what unicode() expects.

Platform is Ubuntu 9.04, Python 2.6.2

like image 584
pablobm Avatar asked Sep 02 '09 17:09

pablobm


People also ask

What is default Python encoding?

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding. (There are also UTF-16 and UTF-32 encodings, but they are less frequently used than UTF-8.)

What is Python 3 default encoding?

By default in Python 3, we are on the left side in the world of Unicode code points for strings. We only need to go back and forth with bytes while writing or reading the data. Default encoding during this conversion is UTF-8, but other encodings can also be used.


3 Answers

e[0] isn't encoded with latin-1; it just so happens that the byte \xbd, when decoded as latin-1, is the character U+00BD.

The conversion occurs in Objects/floatobject.c.

First, the unicode string must be converted to a byte string. This is performed using PyUnicode_EncodeDecimal():

if (PyUnicode_EncodeDecimal(PyUnicode_AS_UNICODE(v),
                            PyUnicode_GET_SIZE(v),
                            s_buffer,
                            NULL))
        return NULL;

which is implemented in unicodeobject.c. It doesn't perform any sort of character set conversion, it just writes bytes with values equal to the unicode ordinals of the string. In this case, U+00BD -> 0xBD.

The statement formatting the error is:

PyOS_snprintf(buffer, sizeof(buffer),
              "invalid literal for float(): %.200s", s);

where s contains the byte string created earlier. PyOS_snprintf() writes a byte string, and s is a byte string, so it just includes it directly.

like image 158
John Millikin Avatar answered Oct 20 '22 04:10

John Millikin


Very good question!

I took the liberty to dig into Python's source code, which is a mere command away on properly set up linux distributions (apt-get source python2.5)

Damn, John Millikin beat me to it. That's right, PyUnicode_EncodeDecimal is the answer it does this here:

/* (Loop ch in the unicode string) */
    if (Py_UNICODE_ISSPACE(ch)) {
        *output++ = ' ';
        ++p;
        continue;
    }
    decimal = Py_UNICODE_TODECIMAL(ch);
    if (decimal >= 0) {
        *output++ = '0' + decimal;
        ++p;
        continue;
    }
    if (0 < ch && ch < 256) {
        *output++ = (char)ch;
        ++p;
        continue;
    }
    /* All other characters are considered unencodable */
    collstart = p;
    collend = p+1;
    while (collend < end) {
        if ((0 < *collend && *collend < 256) ||
            !Py_UNICODE_ISSPACE(*collend) ||
            Py_UNICODE_TODECIMAL(*collend))
            break;
    }

See, it leaves all unicode code points < 256 in place, which are the latin-1 characters, based on Unicode's backward compatibility.


Addendum

With this in place, you can verify by trying other non-latin-1 characters, it will throw a different exception:

>>> float(u"ħ")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'decimal' codec can't encode character u'\u0127' in position 0: invalid decimal Unicode string
like image 5
u0b34a0f6ae Avatar answered Oct 20 '22 05:10

u0b34a0f6ae


The ASCII encoding only includes the bytes with values <= 127. The range of characters represented by these bytes is identical in most encodings; in other words, "A" is chr(65) in ASCII, in latin-1, in UTF-8, and so on.

The one half symbol, however, is not part of the ASCII character set, so when Python tries to encode this symbol into ASCII, it can do nothing but fail.

Update: Here's what happens (I assume we're talking CPython):

float(u'\xbd') leads to PyFloat_FromString in floatobject.c being called. This function, giving a unicode object, in turn calls PyUnicode_EncodeDecimal in unicodeobject.c being called. From skimming over the code, I get it that this function turns the unicode object into a string by replacing every character with a unicode codepoint <256 with the byte of that value, i.e. the one half character, having the codepoint 189, is turned into chr(89).

Then, PyFloat_FromString does its work as usual. At this moment, it's working with a regular string, which happens to be containing a non-ASCII range byte. It doesn't care about this; it just finds a byte that's not a digit, a period or the like, so it raises the value error.

The argument to this exception is a string

"invalid literal for float(): " + evil_string

That's fine; an exception message is, after all, a string. It's only when you try to decode this string, using the default encoding ASCII, that this turns into a problem.

like image 2
balpha Avatar answered Oct 20 '22 05:10

balpha