I am parsing a huge xml file which contains many empty elements such as
<MemoryEnv></MemoryEnv>
When serializing with
etree.tostring(root_element, pretty_print=True)
the output element is collapsed to
<MemoryEnv/>
Is there any way to prevent this? the etree.tostring()
does not provide such a facility.
Is there a way interfere with lxml's tostring()
serializer?
Btw, the html
module does not work. It's not designed for XML, and
it does not create empty elements in their original form.
The problem is, that although collapsed and uncollapsed forms of an empty element are equivalent, the program that parses this file won't work with collapsed empty elements.
Using XML method (c14n) for printing and it works with lxml, it does not collapse empty elements.
>>> from lxml import etree
>>> s = "<MemoryEnv></MemoryEnv>"
>>> root_element = etree.XML(s)
>>> etree.tostring(root_element, method="c14n")
b'<MemoryEnv></MemoryEnv>'
Here is a way to do it. Ensure that the text
value for all empty elements is not None
.
Example:
from lxml import etree
XML = """
<root>
<MemoryEnv></MemoryEnv>
<AlsoEmpty></AlsoEmpty>
<foo>bar</foo>
</root>"""
doc = etree.fromstring(XML)
for elem in doc.iter():
if elem.text == None:
elem.text = ''
print etree.tostring(doc)
Output:
<root>
<MemoryEnv></MemoryEnv>
<AlsoEmpty></AlsoEmpty>
<foo>bar</foo>
</root>
An alternative is to use the write_c14n()
method to write canonical XML (which does not use the special empty-element syntax) to a file.
from lxml import etree
XML = """
<root>
<MemoryEnv></MemoryEnv>
<AlsoEmpty></AlsoEmpty>
<foo>bar</foo>
</root>"""
doc = etree.fromstring(XML)
doc.getroottree().write_c14n("out.xml")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With