In Python 2.7, when passing a unicode string to ElementTree's fromstring()
method that has encoding="UTF-16"
in the XML declaration, I'm getting a ParseError saying that the encoding specified is incorrect:
>>> from xml.etree import ElementTree
>>> data = u'<?xml version="1.0" encoding="utf-16"?><root/>'
>>> ElementTree.fromstring(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1300, in XML
parser.feed(text)
File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: encoding specified in XML declaration is incorrect: line 1, column 30
What does that mean? What makes ElementTree think so?
After all, I'm passing in unicode codepoints, not a byte string. There is no encoding involved here. How can it be incorrect?
Of course, one could argue that any encoding is incorrect, as these unicode codepoints are not encoded. However, then why is UTF-8 not rejected as "incorrect encoding"?
>>> ElementTree.fromstring(u'<?xml version="1.0" encoding="utf-8"?><root/>')
I can solve this problem easily either by encoding the unicode string into a UTF-16-encoded byte string and passing that to fromstring()
or by replacing encoding="utf-16"
with encoding="utf-8"
in the unicode string, but I would like to understand why that exception is raised. The documentation of ElementTree says nothing about only accepting byte strings.
Specifically, I would like to avoid these additional operations because my input data can get quite big and I would like to avoid having them twice in memory and the CPU overhead of processing them more than absolutely necessary.
UTF stands for UCS Transformation Format, and UCS itself means Universal Character Set. The number 8 or 16 refers to the number of bits used to represent a character. They are either 8(1 to 4 bytes) or 16(2 or 4 bytes). For the documents without encoding information, UTF-8 is set by default.
Unicode Transformation Format, 8-bit encoding form is designed for ease of use with existing ASCII-based systems and enables use of all the characters in the Unicode standard.
UTF-8 is the default character encoding for XML documents. Character encoding can be studied in our Character Set Tutorial. UTF-8 is also the default encoding for HTML5, CSS, JavaScript, PHP, and SQL.
I'm not going to try to justify the behavior, but to explain why it's actually happening with the code as written.
In short: the XML parser that Python uses, expat, operates on bytes, not unicode characters. You MUST call .encode('utf-16-be')
or .encode('utf-16-le')
on the string before you pass it to ElementTree.fromstring
:
ElementTree.fromstring(data.encode('utf-16-be'))
Proof: ElementTree.fromstring
eventually calls down into pyexpat.xmlparser.Parse
, which is implemented in pyexpat.c:
static PyObject *
xmlparse_Parse(xmlparseobject *self, PyObject *args)
{
char *s;
int slen;
int isFinal = 0;
if (!PyArg_ParseTuple(args, "s#|i:Parse", &s, &slen, &isFinal))
return NULL;
return get_parse_result(self, XML_Parse(self->itself, s, slen, isFinal));
}
So the unicode parameter you passed in gets converted using s#
. The docs for PyArg_ParseTuple
say:
s# (string, Unicode or any read buffer compatible object) [const char *, int (or Py_ssize_t, see below)] This variant on s stores into two C variables, the first one a pointer to a character string, the second one its length. In this case the Python string may contain embedded null bytes. Unicode objects pass back a pointer to the default encoded string version of the object if such a conversion is possible. All other read-buffer compatible objects pass back a reference to the raw internal data representation.
Let's check this out:
from xml.etree import ElementTree
data = u'<?xml version="1.0" encoding="utf-8"?><root>\u2163</root>'
print ElementTree.fromstring(data)
gives the error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2163' in position 44: ordinal not in range(128)
which means that when you were specifying encoding="utf-8"
, you were just getting lucky that there weren't non-ASCII characters in your input when the Unicode string got encoded to ASCII. If you add the following before you parse, UTF-8 works as expected with that example:
import sys
reload(sys).setdefaultencoding('utf8')
however, it doesn't work to set the defaultencoding to 'utf-16-be' or 'utf-16-le', since the Python bits of ElementTree do direct string comparisons which start to fail in UTF-16 land.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With