Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing XML with undeclared prefixes in Python

I am trying to parse XML data with Python that uses prefixes, but not every file has the declaration of the prefix. Example XML:

<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
    <thing>Word</thing>
    <abc:thing2>Another Word</abc:thing2>
</item>

I have been using xml.etree.ElementTree to parse these files, but whenever the prefix is not properly declared, ElementTree throws a parse error. (unbound prefix, right at the start of <abc:thing2>) Searching for this error leads me to solutions that suggest I fix the namespace declaration. However, I do not control the XML that I need to work with, so modifying the input files is not a viable option.

Searching for namespace parsing in general leads me to many questions about searching in namespace-agnostic way, which is not what I need.

I am looking for some way to automatically parse these files, even if the namespace declaration is broken. I have thought about doing the following:

  • tell ElementTree what namespaces to expect beforehand, because I do know which ones can occur. I found register_namespace, but that does not seem to work.
  • have the full DTD read in before parsing, and see if that solves it. I could not find a way to do this with ElementTree.
  • tell ElementTree to not bother about namespaces at all. It should not cause issues with my data, but I found no way to do this
  • use some other parsing library that can handle this issue - though I prefer not to need installation of extra libraries. I have difficulty seeing from the documentation if any others would be able to solve my issue.
  • some other route that I am currently not seeing?

UPDATE: After Har07 put me on the path of lxml, I tried to see if this would let me perform the different solutions I had thought of, and what the result would be:

  • telling the parser what namespaces to expect beforehand: I still could not find any 'official' way to do this, but in my searches before I had found the suggestion to simply add the requisite declaration to the data programmatically. (for a different programming situation - unfortunately I can't find the link anymore) It seemed terribly hacky to me, but I tried it anyway. It involves loading the data as a string, changing the enclosing element to have the right xmlns declarations, and then handing it off to lxml.etree's fromstring method. Unfortunately, that also requires removing all reference to encoding declaration from the string. It works, though.
  • Read in the DTD before parsing: it is possible with lxml (through attribute_defaults, dtd_validation, or load_dtd), but unfortunately does not solve the namespace issue.
  • Telling lxml not to bother about namespaces: possible through the recover option. Unfortunately, that also ignores other ways in which the XML may be broken (see Har07's answer for details)
like image 648
Anique Avatar asked Jun 02 '15 13:06

Anique


1 Answers

One possible way is using ElementTree compatible library, lxml. For example :

from lxml import etree as ElementTree

xml = """<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
    <thing>Word</thing>
    <abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)

thing = tree.xpath("//thing")[0]
print(ElementTree.tostring(thing))

All you need to do for parsing a non well-formed XML using lxml is passing parameter recover=True to constructor of XMLParser. lxml also has full support for xpath 1.0 which is very useful when you need to get part of XML document using more complex criteria.

UPDATE :

I don't know all the types of XML error that recover=True option can tolerate. But here is another type of error that I know besides unbound namespace prefix: unclosed tag. lxml will fix -rather than ignore- unclosed tag by adding corresponding closing tag automatically. For example, given the following broken XML :

xml = """<item subtype="bla">
    <thing>Word</thing>
    <bad>
    <abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)

print(ElementTree.tostring(tree))

The final output XML after parsed by lxml is as follow :

<item subtype="bla">
    <thing>Word</thing>
    <bad>
    <abc:thing2>Another Word</abc:thing2>
</bad></item>
like image 195
har07 Avatar answered Sep 27 '22 17:09

har07