The following code examines the behaviour of the float()
method when fed a non-ascii symbol:
import sys
try:
float(u'\xbd')
except ValueError as e:
print sys.getdefaultencoding() # in my system, this is 'ascii'
print e[0].decode('latin-1') # u'invalid literal for float(): ' followed by the 1/2 (one half) character
print unicode(e[0]) # raises "UnicodeDecodeError: 'ascii' codec can't decode byte 0xbd in position 29: ordinal not in range(128)"
My question: why is the error message e[0]
encoded in Latin-1? The default encoding is Ascii, and this seems to be what unicode()
expects.
Platform is Ubuntu 9.04, Python 2.6.2
UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding. (There are also UTF-16 and UTF-32 encodings, but they are less frequently used than UTF-8.)
By default in Python 3, we are on the left side in the world of Unicode code points for strings. We only need to go back and forth with bytes while writing or reading the data. Default encoding during this conversion is UTF-8, but other encodings can also be used.
e[0] isn't encoded with latin-1; it just so happens that the byte \xbd, when decoded as latin-1, is the character U+00BD.
The conversion occurs in Objects/floatobject.c
.
First, the unicode string must be converted to a byte string. This is performed using PyUnicode_EncodeDecimal()
:
if (PyUnicode_EncodeDecimal(PyUnicode_AS_UNICODE(v),
PyUnicode_GET_SIZE(v),
s_buffer,
NULL))
return NULL;
which is implemented in unicodeobject.c
. It doesn't perform any sort of character set conversion, it just writes bytes with values equal to the unicode ordinals of the string. In this case, U+00BD -> 0xBD.
The statement formatting the error is:
PyOS_snprintf(buffer, sizeof(buffer),
"invalid literal for float(): %.200s", s);
where s
contains the byte string created earlier. PyOS_snprintf()
writes a byte string, and s
is a byte string, so it just includes it directly.
Very good question!
I took the liberty to dig into Python's source code, which is a mere command away on properly set up linux distributions (apt-get source python2.5
)
Damn, John Millikin beat me to it. That's right, PyUnicode_EncodeDecimal
is the answer it does this here:
/* (Loop ch in the unicode string) */
if (Py_UNICODE_ISSPACE(ch)) {
*output++ = ' ';
++p;
continue;
}
decimal = Py_UNICODE_TODECIMAL(ch);
if (decimal >= 0) {
*output++ = '0' + decimal;
++p;
continue;
}
if (0 < ch && ch < 256) {
*output++ = (char)ch;
++p;
continue;
}
/* All other characters are considered unencodable */
collstart = p;
collend = p+1;
while (collend < end) {
if ((0 < *collend && *collend < 256) ||
!Py_UNICODE_ISSPACE(*collend) ||
Py_UNICODE_TODECIMAL(*collend))
break;
}
See, it leaves all unicode code points < 256 in place, which are the latin-1 characters, based on Unicode's backward compatibility.
Addendum
With this in place, you can verify by trying other non-latin-1 characters, it will throw a different exception:
>>> float(u"ħ")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'decimal' codec can't encode character u'\u0127' in position 0: invalid decimal Unicode string
The ASCII encoding only includes the bytes with values <= 127
. The range of characters represented by these bytes is identical in most encodings; in other words, "A" is chr(65)
in ASCII, in latin-1, in UTF-8, and so on.
The one half symbol, however, is not part of the ASCII character set, so when Python tries to encode this symbol into ASCII, it can do nothing but fail.
Update: Here's what happens (I assume we're talking CPython):
float(u'\xbd')
leads to PyFloat_FromString
in floatobject.c being called. This function, giving a unicode object, in turn calls PyUnicode_EncodeDecimal
in unicodeobject.c being called. From skimming over the code, I get it that this function turns the unicode object into a string by replacing every character with a unicode codepoint <256
with the byte of that value, i.e. the one half character, having the codepoint 189, is turned into chr(89)
.
Then, PyFloat_FromString
does its work as usual. At this moment, it's working with a regular string, which happens to be containing a non-ASCII range byte. It doesn't care about this; it just finds a byte that's not a digit, a period or the like, so it raises the value error.
The argument to this exception is a string
"invalid literal for float(): " + evil_string
That's fine; an exception message is, after all, a string. It's only when you try to decode this string, using the default encoding ASCII, that this turns into a problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With