Why does ElementTree reject UTF-16 XML declarations with "encoding incorrect"?

Tags:

In Python 2.7, when passing a unicode string to ElementTree's fromstring() method that has encoding="UTF-16" in the XML declaration, I'm getting a ParseError saying that the encoding specified is incorrect:

>>> from xml.etree import ElementTree
>>> data = u'<?xml version="1.0" encoding="utf-16"?><root/>'
>>> ElementTree.fromstring(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1300, in XML
    parser.feed(text)
  File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: encoding specified in XML declaration is incorrect: line 1, column 30

What does that mean? What makes ElementTree think so?

After all, I'm passing in unicode codepoints, not a byte string. There is no encoding involved here. How can it be incorrect?

Of course, one could argue that any encoding is incorrect, as these unicode codepoints are not encoded. However, then why is UTF-8 not rejected as "incorrect encoding"?

>>> ElementTree.fromstring(u'<?xml version="1.0" encoding="utf-8"?><root/>')

I can solve this problem easily either by encoding the unicode string into a UTF-16-encoded byte string and passing that to fromstring() or by replacing encoding="utf-16" with encoding="utf-8" in the unicode string, but I would like to understand why that exception is raised. The documentation of ElementTree says nothing about only accepting byte strings.

Specifically, I would like to avoid these additional operations because my input data can get quite big and I would like to avoid having them twice in memory and the CPU overhead of processing them more than absolutely necessary.

699

asked Jun 04 '14 19:06

Henrik Heimbuerger

1 Answers

I'm not going to try to justify the behavior, but to explain why it's actually happening with the code as written.

In short: the XML parser that Python uses, expat, operates on bytes, not unicode characters. You MUST call .encode('utf-16-be') or .encode('utf-16-le') on the string before you pass it to ElementTree.fromstring:

ElementTree.fromstring(data.encode('utf-16-be'))

Proof: ElementTree.fromstring eventually calls down into pyexpat.xmlparser.Parse, which is implemented in pyexpat.c:

static PyObject *
xmlparse_Parse(xmlparseobject *self, PyObject *args)
{
    char *s;
    int slen;
    int isFinal = 0;

    if (!PyArg_ParseTuple(args, "s#|i:Parse", &s, &slen, &isFinal))
        return NULL;

    return get_parse_result(self, XML_Parse(self->itself, s, slen, isFinal));
}

So the unicode parameter you passed in gets converted using s#. The docs for PyArg_ParseTuple say:

s# (string, Unicode or any read buffer compatible object) [const char *, int (or Py_ssize_t, see below)] This variant on s stores into two C variables, the first one a pointer to a character string, the second one its length. In this case the Python string may contain embedded null bytes. Unicode objects pass back a pointer to the default encoded string version of the object if such a conversion is possible. All other read-buffer compatible objects pass back a reference to the raw internal data representation.

Let's check this out:

from xml.etree import ElementTree
data = u'<?xml version="1.0" encoding="utf-8"?><root>\u2163</root>'
print ElementTree.fromstring(data)

gives the error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2163' in position 44: ordinal not in range(128)

which means that when you were specifying encoding="utf-8", you were just getting lucky that there weren't non-ASCII characters in your input when the Unicode string got encoded to ASCII. If you add the following before you parse, UTF-8 works as expected with that example:

import sys
reload(sys).setdefaultencoding('utf8')

however, it doesn't work to set the defaultencoding to 'utf-16-be' or 'utf-16-le', since the Python bits of ElementTree do direct string comparisons which start to fail in UTF-16 land.

answered Oct 19 '22 19:10

Joe Hildebrand

Related questions
                            
                                Python re for custom sequence type
                            
                                PyDev Breakpoints in App Engine 1.7.6 broken?
                            
                                'No module named requests' even if I installed requests with pip
                            
                                No module named 'x' when reloading with os.execl()
                            
                                Python unable to find Elasticsearch
                            
                                Python ABC Multiple Inheritance
                            
                                How to optimize multiprocessing in Python
                            
                                Python: How to call an instance method from a class method of the same class
                            
                                What is the difference between json.dumps and str()? [closed]
                            
                                Python newbie - PIP / invalid syntax error [duplicate]
                            
                                Flip x and y axes for Matplotlib imshow()
                            
                                Is it possible to include csv file as part of python package [duplicate]
                            
                                How to programmatically count the number of files in an archive using python
                            
                                python 2.7 set and list remove time complexity
                            
                                Python Invoke - Can't find any collection named 'tasks'!
                            
                                Binary to String/Text in Python
                            
                                Mysterious interaction between Python's slice bounds and "stride"
                            
                                pandas, how to access multiIndex dataframe?
                            
                                How do function descriptors work?
                            
                                re.search Multiple lines Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does ElementTree reject UTF-16 XML declarations with "encoding incorrect"?

Tags:

encoding

unicode

python-unicode

python-2.7

elementtree

Henrik Heimbuerger

People also ask

1 Answers

Joe Hildebrand

Recent Activity

Donate For Us