Can lxml be used to check if xml is well formed or is it too powerful? For example it seems to be able to parse even if xml is not well formed. What's the easiest way to check if an xml file is well formed?
lxml
should've thrown exception when parsing non well-formed XML, for example :
from lxml import etree
xml = """
<multipleroot>
<noclosingtag>
</multipleroot>
<multipleroot></multipleroot>"""
doc = etree.fromstring(xml)
exception thrown:
Traceback (most recent call last):
File "D:\StackOverflow\Python\Q50.py", line 8, in <module>
doc = etree.fromstring(xml)
......
......
XMLSyntaxError: Opening and ending tag mismatch: noclosingtag line 3 and multipleroot, line 4, column 16
However if you explicitly tell XMLParser
to recover non well-formed XML, or you're using HTMLParser
instead, lxml
may still able to parse the XML :
from lxml import etree
xml = """
<multipleroot>
<noclosingtag>
</multipleroot>
<multipleroot></multipleroot>"""
parser = etree.XMLParser(recover=True)
#parser = etree.HTMLParser()
doc = etree.fromstring(xml, parser=parser)
print(etree.tostring(doc))
successfully print parsed XML :
<multipleroot>
<noclosingtag>
</noclosingtag>
<multipleroot/></multipleroot>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With