I am trying to parse XML data with Python that uses prefixes, but not every file has the declaration of the prefix. Example XML:
<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
<thing>Word</thing>
<abc:thing2>Another Word</abc:thing2>
</item>
I have been using xml.etree.ElementTree to parse these files, but whenever the prefix is not properly declared, ElementTree throws a parse error. (unbound prefix
, right at the start of <abc:thing2>
)
Searching for this error leads me to solutions that suggest I fix the namespace declaration. However, I do not control the XML that I need to work with, so modifying the input files is not a viable option.
Searching for namespace parsing in general leads me to many questions about searching in namespace-agnostic way, which is not what I need.
I am looking for some way to automatically parse these files, even if the namespace declaration is broken. I have thought about doing the following:
register_namespace
, but that does not seem to work.UPDATE:
After Har07 put me on the path of lxml
, I tried to see if this would let me perform the different solutions I had thought of, and what the result would be:
xmlns
declarations, and then handing it off to lxml.etree
's fromstring
method. Unfortunately, that also requires removing all reference to encoding declaration from the string. It works, though.lxml
(through attribute_defaults
, dtd_validation
, or load_dtd
), but unfortunately does not solve the namespace issue.lxml
not to bother about namespaces: possible through the recover
option. Unfortunately, that also ignores other ways in which the XML may be broken (see Har07's answer for details)One possible way is using ElementTree
compatible library, lxml
. For example :
from lxml import etree as ElementTree
xml = """<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
<thing>Word</thing>
<abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)
thing = tree.xpath("//thing")[0]
print(ElementTree.tostring(thing))
All you need to do for parsing a non well-formed XML using lxml
is passing parameter recover=True
to constructor of XMLParser
. lxml
also has full support for xpath 1.0 which is very useful when you need to get part of XML document using more complex criteria.
UPDATE :
I don't know all the types of XML error that recover=True
option can tolerate. But here is another type of error that I know besides unbound namespace prefix: unclosed tag. lxml
will fix -rather than ignore- unclosed tag by adding corresponding closing tag automatically. For example, given the following broken XML :
xml = """<item subtype="bla">
<thing>Word</thing>
<bad>
<abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)
print(ElementTree.tostring(tree))
The final output XML after parsed by lxml
is as follow :
<item subtype="bla">
<thing>Word</thing>
<bad>
<abc:thing2>Another Word</abc:thing2>
</bad></item>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With