Preserving original doctype and declaration of an lxml.etree parsed xml

Tags:

I'm using python's lxml and I'm trying to read an xml document, modify and write it back but the original doctype and xml declaration disappears. I'm wondering if there's an easy way of putting it back in whether through lxml or some other solution?

425

asked Oct 19 '12 02:10

incognito2

2 Answers

tl;dr

# adds declaration with version and encoding regardless of
# which attributes were present in the original declaration
# expects utf-8 encoding (encode/decode calls)
# depending on your needs you might want to improve that
from lxml import etree
from xml.dom.minidom import parseString
xml1 = '''\
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root SYSTEM "example.dtd">
<root>...</root>
'''
xml2 = '''\
<root>...</root>
'''
def has_xml_declaration(xml):
    return parseString(xml).version
def process(xml):
    t = etree.fromstring(xml.encode()).getroottree()
    if has_xml_declaration(xml):
        print(etree.tostring(t, xml_declaration=True, encoding=t.docinfo.encoding).decode())
    else:
        print(etree.tostring(t).decode())
process(xml1)
process(xml2)

The following will include the DOCTYPE and the XML declaration:

from lxml import etree
from StringIO import StringIO

tree = etree.parse(StringIO('''<?xml version="1.0" encoding="iso-8859-1"?>
 <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "eggs"> ]>
  <root>
   <a>&tasty;</a>
 </root>
'''))

docinfo = tree.docinfo
print etree.tostring(tree, xml_declaration=True, encoding=docinfo.encoding)

Note, tostring does not preserve the DOCTYPE if you create an Element (e.g. using fromstring), it only works when you process the XML using parse.

Update: as pointed out by J.F. Sebastian my assertion about fromstring is not true.

Here is some code to highlight the differences between Element and ElementTree serialization:

from lxml import etree
from StringIO import StringIO

xml_str = '''<?xml version="1.0" encoding="iso-8859-1"?>
 <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "eggs"> ]>
  <root>
   <a>&tasty;</a>
 </root>
'''

# get the ElementTree using parse
parse_tree = etree.parse(StringIO(xml_str))
encoding = parse_tree.docinfo.encoding
result = etree.tostring(parse_tree, xml_declaration=True, encoding=encoding)
print "%s\nparse ElementTree:\n%s\n" % ('-'*20, result)

# get the ElementTree using fromstring
fromstring_tree = etree.fromstring(xml_str).getroottree()
encoding = fromstring_tree.docinfo.encoding
result = etree.tostring(fromstring_tree, xml_declaration=True, encoding=encoding)
print "%s\nfromstring ElementTree:\n%s\n" % ('-'*20, result)

# DOCTYPE is lost, and no access to encoding
fromstring_element = etree.fromstring(xml_str)
result = etree.tostring(fromstring_element, xml_declaration=True)
print "%s\nfromstring Element:\n%s\n" % ('-'*20, result)

and the output is:

--------------------
parse ElementTree:
<?xml version='1.0' encoding='iso-8859-1'?>
<!DOCTYPE root SYSTEM "test" [
<!ENTITY tasty "eggs">
]>
<root>
   <a>eggs</a>
 </root>

--------------------
fromstring ElementTree:
<?xml version='1.0' encoding='iso-8859-1'?>
<!DOCTYPE root SYSTEM "test" [
<!ENTITY tasty "eggs">
]>
<root>
   <a>eggs</a>
 </root>

--------------------
fromstring Element:
<?xml version='1.0' encoding='ASCII'?>
<root>
   <a>eggs</a>
 </root>

139

answered Nov 12 '22 16:11

John Keyes

You can also preserve DOCTYPE and the XML declaration with fromstring():

import sys
from StringIO import StringIO
from lxml import etree

xml = r'''<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
 <head>
 <title>example</title>
 </head>
 <body>
 <p>This is an example</p>
 </body>
</html>'''

tree = etree.fromstring(xml).getroottree() # or etree.parse(file)
tree.write(sys.stdout, xml_declaration=True, encoding=tree.docinfo.encoding)

Output

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
 <title>example</title>
 </head>
 <body>
 <p>This is an example</p>
 </body>
</html>

Note the xml declaration (with correct encoding) and doctype are present. It even (possibly incorrectly) uses ' instead of " in the xml declaration and adds Content-Type to the <head>.

For the @John Keyes' example input it produces the same results as etree.tostring() in the answer.

answered Nov 12 '22 18:11

jfs

Related questions
                            
                                PyTorch: What's the difference between state_dict and parameters()?
                            
                                Use Python Pool with context manager or close and join
                            
                                pytorch RuntimeError: Expected object of scalar type Double but got scalar type Float
                            
                                Spark: Why does Python significantly outperform Scala in my use case?
                            
                                How do I reply to an email using the Python imaplib and include the original message?
                            
                                Simple multilingual CMS? [closed]
                            
                                python regex match and replace
                            
                                How can I check if a point is below a line or not ?
                            
                                How to draw a line outside of an axis in matplotlib (in figure coordinates)
                            
                                How can I implement a secure WebSocket (wss://) server in Python?
                            
                                Comprehensive list of Python protocols/interfaces
                            
                                Lazy loading of columns in sqlalchemy
                            
                                Multiple context `with` statement in Python 2.6
                            
                                mod_wsgi isn't honoring WSGIPythonHome
                            
                                RabbitMQ, Pika and reconnection strategy
                            
                                Unicode error when outputting python script output to file
                            
                                Unable to install pymssql
                            
                                Best practice when defining instance variables
                            
                                Recursive module import and reload
                            
                                How can I pool connections using psycopg and gevent?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Preserving original doctype and declaration of an lxml.etree parsed xml

Tags:

python

doctype

lxml

xml-declaration

incognito2

People also ask

2 Answers

John Keyes

Output

jfs

Recent Activity

Donate For Us