Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python reporting line/column of origin of XML node

Tags:

python

dom

xml

sax

I'm currently using xml.dom.minidom to parse some XML in python. After parsing, I'm doing some reporting on the content, and would like to report the line (and column) where the tag started in the source XML document, but I don't see how that's possible.

I'd like to stick with xml.dom / xml.dom.minidom if possible, but if I need to use a SAX parser to get the origin info, I can do that -- ideal in that case would be using SAX to track node location, but still end up with a DOM for my post-processing.

Any suggestions on how to do this? Hopefully I'm just overlooking something in the docs and this extremely easy.

like image 602
Jeremy Slade Avatar asked Jan 25 '11 01:01

Jeremy Slade


1 Answers

By monkeypatching the minidom content handler I was able to record line and column number for each node (as the 'parse_position' attribute). It's a little dirty, but I couldn't see any "officially sanctioned" way of doing it :) Here's my test script:

from xml.dom import minidom
import xml.sax

doc = """\
<File>
  <name>Name</name>
  <pos>./</pos>
</File>
"""


def set_content_handler(dom_handler):
    def startElementNS(name, tagName, attrs):
        orig_start_cb(name, tagName, attrs)
        cur_elem = dom_handler.elementStack[-1]
        cur_elem.parse_position = (
            parser._parser.CurrentLineNumber,
            parser._parser.CurrentColumnNumber
        )

    orig_start_cb = dom_handler.startElementNS
    dom_handler.startElementNS = startElementNS
    orig_set_content_handler(dom_handler)

parser = xml.sax.make_parser()
orig_set_content_handler = parser.setContentHandler
parser.setContentHandler = set_content_handler

dom = minidom.parseString(doc, parser)
pos = dom.firstChild.parse_position
print("Parent: '{0}' at {1}:{2}".format(
    dom.firstChild.localName, pos[0], pos[1]))
for child in dom.firstChild.childNodes:
    if child.localName is None:
        continue
    pos = child.parse_position
    print "Child: '{0}' at {1}:{2}".format(child.localName, pos[0], pos[1])

It outputs the following:

Parent: 'File' at 1:0
Child: 'name' at 2:2
Child: 'pos' at 3:2
like image 90
aknuds1 Avatar answered Oct 23 '22 10:10

aknuds1