I'm using python's lxml and I'm trying to read an xml document, modify and write it back but the original doctype and xml declaration disappears. I'm wondering if there's an easy way of putting it back in whether through lxml or some other solution?
Parsing from strings and files. lxml. etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like objects. The main parse functions are fromstring() and parse(), both called with the source as first argument.
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.
Is lxml safe to use? The python package lxml was scanned for known vulnerabilities and missing license, and no issues were found. Thus the package was deemed as safe to use.
In lxml. objectify, this directly translates to enforcing a specific object tree, i.e. expected object attributes are ensured to be there and to have the expected type. This can easily be achieved through XML Schema validation at parse time.
tl;dr
# adds declaration with version and encoding regardless of
# which attributes were present in the original declaration
# expects utf-8 encoding (encode/decode calls)
# depending on your needs you might want to improve that
from lxml import etree
from xml.dom.minidom import parseString
xml1 = '''\
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root SYSTEM "example.dtd">
<root>...</root>
'''
xml2 = '''\
<root>...</root>
'''
def has_xml_declaration(xml):
return parseString(xml).version
def process(xml):
t = etree.fromstring(xml.encode()).getroottree()
if has_xml_declaration(xml):
print(etree.tostring(t, xml_declaration=True, encoding=t.docinfo.encoding).decode())
else:
print(etree.tostring(t).decode())
process(xml1)
process(xml2)
The following will include the DOCTYPE and the XML declaration:
from lxml import etree
from StringIO import StringIO
tree = etree.parse(StringIO('''<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "eggs"> ]>
<root>
<a>&tasty;</a>
</root>
'''))
docinfo = tree.docinfo
print etree.tostring(tree, xml_declaration=True, encoding=docinfo.encoding)
Note, tostring
does not preserve the DOCTYPE
if you create an Element
(e.g. using fromstring
), it only works when you process the XML using parse
.
Update: as pointed out by J.F. Sebastian my assertion about fromstring
is not true.
Here is some code to highlight the differences between Element
and ElementTree
serialization:
from lxml import etree
from StringIO import StringIO
xml_str = '''<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "eggs"> ]>
<root>
<a>&tasty;</a>
</root>
'''
# get the ElementTree using parse
parse_tree = etree.parse(StringIO(xml_str))
encoding = parse_tree.docinfo.encoding
result = etree.tostring(parse_tree, xml_declaration=True, encoding=encoding)
print "%s\nparse ElementTree:\n%s\n" % ('-'*20, result)
# get the ElementTree using fromstring
fromstring_tree = etree.fromstring(xml_str).getroottree()
encoding = fromstring_tree.docinfo.encoding
result = etree.tostring(fromstring_tree, xml_declaration=True, encoding=encoding)
print "%s\nfromstring ElementTree:\n%s\n" % ('-'*20, result)
# DOCTYPE is lost, and no access to encoding
fromstring_element = etree.fromstring(xml_str)
result = etree.tostring(fromstring_element, xml_declaration=True)
print "%s\nfromstring Element:\n%s\n" % ('-'*20, result)
and the output is:
--------------------
parse ElementTree:
<?xml version='1.0' encoding='iso-8859-1'?>
<!DOCTYPE root SYSTEM "test" [
<!ENTITY tasty "eggs">
]>
<root>
<a>eggs</a>
</root>
--------------------
fromstring ElementTree:
<?xml version='1.0' encoding='iso-8859-1'?>
<!DOCTYPE root SYSTEM "test" [
<!ENTITY tasty "eggs">
]>
<root>
<a>eggs</a>
</root>
--------------------
fromstring Element:
<?xml version='1.0' encoding='ASCII'?>
<root>
<a>eggs</a>
</root>
You can also preserve DOCTYPE and the XML declaration with fromstring()
:
import sys
from StringIO import StringIO
from lxml import etree
xml = r'''<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>example</title>
</head>
<body>
<p>This is an example</p>
</body>
</html>'''
tree = etree.fromstring(xml).getroottree() # or etree.parse(file)
tree.write(sys.stdout, xml_declaration=True, encoding=tree.docinfo.encoding)
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>example</title>
</head>
<body>
<p>This is an example</p>
</body>
</html>
Note the xml declaration (with correct encoding) and doctype are present. It even (possibly incorrectly) uses '
instead of "
in the xml declaration and adds Content-Type
to the <head>
.
For the @John Keyes' example input it produces the same results as etree.tostring()
in the answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With