I'm using Python (minidom) to parse an XML file that prints a hierarchical structure that looks something like this (indentation is used here to show the significant hierarchical relationship): <pre class="prettyprint"><code>My Document Overview Basic Features About This Software Platforms Supported </code></pre> Instead, the program iterates multiple times over the nodes and produces the following, printing duplicate nodes. (Looking at the node list at each iteration, it's obvious why it does this but I can't seem to find a way to get the node list I'm looking for.) <pre class="prettyprint"><code>My Document Overview Basic Features About This Software Platforms Supported Basic Features About This Software Platforms Supported Platforms Supported </code></pre> Here is the XML source file: <pre class="prettyprint"><code><?xml version="1.0" encoding="UTF-8"?> <DOCMAP> <Topic Target="ALL"> <Title>My Document</Title> </Topic> <Topic Target="ALL"> <Title>Overview</Title> <Topic Target="ALL"> <Title>Basic Features</Title> </Topic> <Topic Target="ALL"> <Title>About This Software</Title> <Topic Target="ALL"> <Title>Platforms Supported</Title> </Topic> </Topic> </Topic> </DOCMAP> </code></pre> Here is the Python program: <pre class="prettyprint"><code>import xml.dom.minidom from xml.dom.minidom import Node dom = xml.dom.minidom.parse("test.xml") Topic=dom.getElementsByTagName('Topic') i = 0 for node in Topic: alist=node.getElementsByTagName('Title') for a in alist: Title= a.firstChild.data print Title </code></pre> I could fix the problem by not nesting 'Topic' elements, by changing the lower level topic names to something like 'SubTopic1' and 'SubTopic2'. But, I want to take advantage of built-in XML hierarchical structuring without needing different element names; it seems that I should be able to nest 'Topic' elements and that there should be some way to know which level 'Topic' I'm currently looking at. I've tried a number of different XPath functions without much success.

getElementsByTagName is recursive, you'll get all descendents with a matching tagName. Because your Topics contain other Topics that also have Titles, the call will get the lower-down Titles many times. If you want to ask for all matching direct children only, and you don't have XPath available, you can write a simple filter, eg.: <pre class="prettyprint"><code>def getChildrenByTagName(node, tagName): for child in node.childNodes: if child.nodeType==child.ELEMENT_NODE and (tagName=='*' or child.tagName==tagName): yield child for topic in document.getElementsByTagName('Topic'): title= list(getChildrenByTagName('Title'))[0] # or just get(...).next() print title.firstChild.data </code></pre>

XML Parsing with Python and minidom

Tags:

python

xml

minidom

I'm using Python (minidom) to parse an XML file that prints a hierarchical structure that looks something like this (indentation is used here to show the significant hierarchical relationship):

My Document
Overview
    Basic Features
    About This Software
        Platforms Supported

Instead, the program iterates multiple times over the nodes and produces the following, printing duplicate nodes. (Looking at the node list at each iteration, it's obvious why it does this but I can't seem to find a way to get the node list I'm looking for.)

My Document
Overview
Basic Features
About This Software
Platforms Supported
Basic Features
About This Software
Platforms Supported
Platforms Supported

Here is the XML source file:

<?xml version="1.0" encoding="UTF-8"?>
<DOCMAP>
    <Topic Target="ALL">
        <Title>My Document</Title>
    </Topic>
    <Topic Target="ALL">
        <Title>Overview</Title>
        <Topic Target="ALL">
            <Title>Basic Features</Title>
        </Topic>
        <Topic Target="ALL">
            <Title>About This Software</Title>
            <Topic Target="ALL">
                <Title>Platforms Supported</Title>
            </Topic>
        </Topic>
    </Topic>
</DOCMAP>

Here is the Python program:

import xml.dom.minidom
from xml.dom.minidom import Node

dom = xml.dom.minidom.parse("test.xml")
Topic=dom.getElementsByTagName('Topic')
i = 0
for node in Topic:
    alist=node.getElementsByTagName('Title')
    for a in alist:
        Title= a.firstChild.data
        print Title

I could fix the problem by not nesting 'Topic' elements, by changing the lower level topic names to something like 'SubTopic1' and 'SubTopic2'. But, I want to take advantage of built-in XML hierarchical structuring without needing different element names; it seems that I should be able to nest 'Topic' elements and that there should be some way to know which level 'Topic' I'm currently looking at.

I've tried a number of different XPath functions without much success.

299

asked Oct 20 '09 19:10

hWorks

1 Answers

getElementsByTagName is recursive, you'll get all descendents with a matching tagName. Because your Topics contain other Topics that also have Titles, the call will get the lower-down Titles many times.

If you want to ask for all matching direct children only, and you don't have XPath available, you can write a simple filter, eg.:

def getChildrenByTagName(node, tagName):
    for child in node.childNodes:
        if child.nodeType==child.ELEMENT_NODE and (tagName=='*' or child.tagName==tagName):
            yield child

for topic in document.getElementsByTagName('Topic'):
    title= list(getChildrenByTagName('Title'))[0]         # or just get(...).next()
    print title.firstChild.data

136

answered Sep 19 '22 23:09

bobince

Related questions
                            
                                Python/pandas idiom for if/then/else [duplicate]
                            
                                How to force virtualenv to install latest setuptools and pip from pypi?
                            
                                python os.environ, os.putenv, /usr/bin/env
                            
                                How can I make PyInstaller's .spec files actually portable? (woes absolute path for 'pathex' parameter)
                            
                                Default kwarg values for Python's str.format() method
                            
                                Unexpected keyword argument "context" when using appcfg.py
                            
                                Play Animations in GIF with Tkinter [duplicate]
                            
                                Intellij/Pycharm can't debug Python modules
                            
                                How can I reuse exception handling code for multiple functions in Python?
                            
                                How to perform JPEG compression in Python without writing/reading
                            
                                Flask, Python and Socket.io: multithreading app is giving me "RuntimeError: working outside of request context"
                            
                                PySpark DataFrames - way to enumerate without converting to Pandas?
                            
                                How to make the command-line / interpreter pane/window bigger in pudb?
                            
                                Pandas groupby and make set of items
                            
                                Assign external function to class variable in Python
                            
                                Installation: Reportlab: "ImportError: No module named reportlab.lib"
                            
                                after pip successful installed: ModuleNotFoundError
                            
                                Pandas DataFrame column naming conventions
                            
                                unable to decode Python web request
                            
                                Is it normal that running python under valgrind shows many errors with memory?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With