Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python ElementTree generate not well formed XML file with special character '\x0b'

I used ElementTree to generate xml with special character of '\x0b', then use minidom to parse it. It will throw not well-formed error.

import xml.etree.ElementTree as ET
from xml.dom import minidom
root = ET.Element('root')
root.text='\x0b'
xml = ET.tostring(root, 'UTF-8')
print(xml)
pretty_tree = minidom.parseString(xml)

Generated XML: <root>\x0b</root>

Error:

Traceback (most recent call last):
  File "testXml.py", line 7, in <module>
    pretty_tree = minidom.parseString(xml)
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/minidom.py", line 1968, in parseString
    return expatbuilder.parseString(string)
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/expatbuilder.py", line 925, in parseString
    return builder.parseString(string)
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 6
like image 517
Jianwu Chen Avatar asked Mar 03 '23 09:03

Jianwu Chen


2 Answers

This behaviour has been raised as a bug in the past and resolved as "won't fix".

The author of the ElementTree module commented

For ET, [this behaviour is] very much on purpose. Validating data provided by every single application would kill performance for all of them, even if only a small minority would ever try to serialize data that cannot be represented in XML.

The closing comment (by the maintainer of lxml, who is also a Python core dev) includes these observations:

This is a tricky decision. lxml, for example, validates user input, but that's because it has to process it anyway and does it along the way directly on input (and very efficiently in C code). ET, on the other hand, is rather lenient about what it allows users to do and doesn't apply much processing to user input. It even allows invalid trees during processing and only expects the tree to be serialisable when requested to serialise it.

I think that's a fair behaviour, because most user input will be ok and shouldn't need to suffer the performance penalty of validating all input. Null-characters are a very rare thing to find in text, for example, and I think it's reasonable to let users handle the few cases by themselves where they can occur.

...

In the end, users who really care about correct output should run some kind of schema validation over it after serialisation, as that would detect not only data issues but also structural and logical issues (such as a missing or empty attribute), specifically for their target data format. In some cases, it might even detect random data corruption due to old non-ECC RAM in the server machine. :)

...

So in summary, ET.tostring will generate xml which is not well-formed, and this is by design. If necessary, the output can be parsed to check that it is well-formed, using ET.fromstring or another parser. Alternatively, lxml can be used instead of ElementTree.

like image 67
snakecharmerb Avatar answered Mar 05 '23 16:03

snakecharmerb


\x0b is an XML restricted character. There is a good description of valid and restricted characters in the answers to this question.

like image 42
Rusty Widebottom Avatar answered Mar 05 '23 16:03

Rusty Widebottom