Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XML Parsing with Python and minidom

I'm using Python (minidom) to parse an XML file that prints a hierarchical structure that looks something like this (indentation is used here to show the significant hierarchical relationship):

My Document
Overview
    Basic Features
    About This Software
        Platforms Supported

Instead, the program iterates multiple times over the nodes and produces the following, printing duplicate nodes. (Looking at the node list at each iteration, it's obvious why it does this but I can't seem to find a way to get the node list I'm looking for.)

My Document
Overview
Basic Features
About This Software
Platforms Supported
Basic Features
About This Software
Platforms Supported
Platforms Supported

Here is the XML source file:

<?xml version="1.0" encoding="UTF-8"?>
<DOCMAP>
    <Topic Target="ALL">
        <Title>My Document</Title>
    </Topic>
    <Topic Target="ALL">
        <Title>Overview</Title>
        <Topic Target="ALL">
            <Title>Basic Features</Title>
        </Topic>
        <Topic Target="ALL">
            <Title>About This Software</Title>
            <Topic Target="ALL">
                <Title>Platforms Supported</Title>
            </Topic>
        </Topic>
    </Topic>
</DOCMAP>

Here is the Python program:

import xml.dom.minidom
from xml.dom.minidom import Node

dom = xml.dom.minidom.parse("test.xml")
Topic=dom.getElementsByTagName('Topic')
i = 0
for node in Topic:
    alist=node.getElementsByTagName('Title')
    for a in alist:
        Title= a.firstChild.data
        print Title

I could fix the problem by not nesting 'Topic' elements, by changing the lower level topic names to something like 'SubTopic1' and 'SubTopic2'. But, I want to take advantage of built-in XML hierarchical structuring without needing different element names; it seems that I should be able to nest 'Topic' elements and that there should be some way to know which level 'Topic' I'm currently looking at.

I've tried a number of different XPath functions without much success.

like image 299
hWorks Avatar asked Oct 20 '09 19:10

hWorks


People also ask

What is XML Minidom?

xml. dom. minidom is a minimal implementation of the Document Object Model interface, with an API similar to that in other languages. It is intended to be simpler than the full DOM and also significantly smaller. Users who are not already proficient with the DOM should consider using the xml.

How do I parse XML in Python?

There are two ways to parse the file using 'ElementTree' module. The first is by using the parse() function and the second is fromstring() function. The parse () function parses XML document which is supplied as a file whereas, fromstring parses XML when supplied as a string i.e within triple quotes.


1 Answers

getElementsByTagName is recursive, you'll get all descendents with a matching tagName. Because your Topics contain other Topics that also have Titles, the call will get the lower-down Titles many times.

If you want to ask for all matching direct children only, and you don't have XPath available, you can write a simple filter, eg.:

def getChildrenByTagName(node, tagName):
    for child in node.childNodes:
        if child.nodeType==child.ELEMENT_NODE and (tagName=='*' or child.tagName==tagName):
            yield child

for topic in document.getElementsByTagName('Topic'):
    title= list(getChildrenByTagName('Title'))[0]         # or just get(...).next()
    print title.firstChild.data
like image 136
bobince Avatar answered Sep 19 '22 23:09

bobince