<p>I have a simple XML document I'm trying to read in with Python DOM (see below):</p> <p><strong>XML File:</strong></p> <pre class="prettyprint"><code><?xml version="1.0" encoding="utf-8"?> <HeaderLookup> <Header> <Reserved>2</Reserved> <CPU>1</CPU> <Flag>1</Flag> <VQI>12</VQI> <Group_ID>16</Group_ID> <DI>2</DI> <DE>1</DE> <ACOSS>5</ACOSS> <RGH>8</RGH> </Header> </HeaderLookup> </code></pre> <p><strong>Python Code:</strong></p> <pre class="prettyprint"><code>from xml.dom import minidom xml_file = open("test.xml") xmlroot = minidom.parse(xml_file).documentElement xml_file.close() for item in xmlroot.getElementsByTagName("Header")[0].childNodes: print item </code></pre> <p><strong>Result:</strong></p> <pre class="prettyprint"><code><DOM Text node "u'\n\t\t'"> <DOM Element: Reserved at 0x28d2828> <DOM Text node "u'\n\t\t'"> <DOM Element: CPU at 0x28d28c8> <DOM Text node "u'\n\t\t'"> <DOM Element: Flag at 0x28d2968> <DOM Text node "u'\n\t\t'"> <DOM Element: VQI at 0x28d2a08> <DOM Text node "u'\n\t\t'"> <DOM Element: Group_ID at 0x28d2ad0> <DOM Text node "u'\n\t\t'"> <DOM Element: DI at 0x28d2b70> <DOM Text node "u'\n\t\t'"> <DOM Element: DE at 0x28d2c10> <DOM Text node "u'\n\t\t'"> <DOM Element: ACOSS at 0x28d2cb0> <DOM Text node "u'\n\t\t'"> <DOM Element: RGH at 0x28d2d50> <DOM Text node "u'\n\t'"> </code></pre> <p>The result should be 9 Child Nodes (Reserved, CPU, Flag, VQI, Group_ID, DI, DE, ACOSS, and RGH), but for some reason it is returning a list of 19 nodes with 10 of them being whitespace (why is this even being considered a node in the first place?!). Can anyone tell me if there's a way to get the XML parser to not include whitespace nodes?</p>

<p>Whitespace is significant in XML, but check out ElementTree, which has a different API for processing XML than the DOM.</p> <h3>Example</h3> <pre class="prettyprint"><code>from xml.etree import ElementTree as et data = '''\ <?xml version="1.0" encoding="utf-8"?> <HeaderLookup> <Header> <Reserved>2</Reserved> <CPU>1</CPU> <Flag>1</Flag> <VQI>12</VQI> <Group_ID>16</Group_ID> <DI>2</DI> <DE>1</DE> <ACOSS>5</ACOSS> <RGH>8</RGH> </Header> </HeaderLookup> ''' tree = et.fromstring(data) for n in tree.find('Header'): print n.tag,'=',n.text </code></pre> <h3>Output</h3> <pre class="prettyprint"><code>Reserved = 2 CPU = 1 Flag = 1 VQI = 12 Group_ID = 16 DI = 2 DE = 1 ACOSS = 5 RGH = 8 </code></pre> <h3>Example (extending previous code)</h3> <p>The whitespace is still present, but it is in <code>.tail</code> attributes. <code>tail</code> is the text node that follows an element (between the end of one element and the start of the next), while <code>text</code> is the text node between the start/end tag of an element.</p> <pre class="prettyprint"><code>def dump(e): print '<%s>' % e.tag print 'text =',repr(e.text) for n in e: dump(n) print '</%s>' % e.tag print 'tail =',repr(e.tail) dump(tree) </code></pre> <h3>Output</h3> <pre class="prettyprint"><code><HeaderLookup> text = '\n ' <Header> text = '\n ' <Reserved> text = '2' </Reserved> tail = '\n ' <CPU> text = '1' </CPU> tail = '\n ' <Flag> text = '1' </Flag> tail = '\n ' <VQI> text = '12' </VQI> tail = '\n ' <Group_ID> text = '16' </Group_ID> tail = '\n ' <DI> text = '2' </DI> tail = '\n ' <DE> text = '1' </DE> tail = '\n ' <ACOSS> text = '5' </ACOSS> tail = '\n ' <RGH> text = '8' </RGH> tail = '\n ' </Header> tail = '\n' </HeaderLookup> tail = None </code></pre>

How do I get Python XML to stop having wasted Child Nodes

Tags:

python

xml

whitespace

nodes

I have a simple XML document I'm trying to read in with Python DOM (see below):

XML File:

<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
    <Header>
        <Reserved>2</Reserved>
        <CPU>1</CPU>
        <Flag>1</Flag>
        <VQI>12</VQI>
        <Group_ID>16</Group_ID>
        <DI>2</DI>
        <DE>1</DE>
        <ACOSS>5</ACOSS>
        <RGH>8</RGH>
    </Header>
</HeaderLookup>

Python Code:

from xml.dom import minidom

xml_file = open("test.xml")
xmlroot = minidom.parse(xml_file).documentElement
xml_file.close()

for item in xmlroot.getElementsByTagName("Header")[0].childNodes:
    print item

Result:

<DOM Text node "u'\n\t\t'">
<DOM Element: Reserved at 0x28d2828>
<DOM Text node "u'\n\t\t'">
<DOM Element: CPU at 0x28d28c8>
<DOM Text node "u'\n\t\t'">
<DOM Element: Flag at 0x28d2968>
<DOM Text node "u'\n\t\t'">
<DOM Element: VQI at 0x28d2a08>
<DOM Text node "u'\n\t\t'">
<DOM Element: Group_ID at 0x28d2ad0>
<DOM Text node "u'\n\t\t'">
<DOM Element: DI at 0x28d2b70>
<DOM Text node "u'\n\t\t'">
<DOM Element: DE at 0x28d2c10>
<DOM Text node "u'\n\t\t'">
<DOM Element: ACOSS at 0x28d2cb0>
<DOM Text node "u'\n\t\t'">
<DOM Element: RGH at 0x28d2d50>
<DOM Text node "u'\n\t'">

The result should be 9 Child Nodes (Reserved, CPU, Flag, VQI, Group_ID, DI, DE, ACOSS, and RGH), but for some reason it is returning a list of 19 nodes with 10 of them being whitespace (why is this even being considered a node in the first place?!). Can anyone tell me if there's a way to get the XML parser to not include whitespace nodes?

462

asked Jun 10 '11 20:06

Dasmowenator

1 Answers

Whitespace is significant in XML, but check out ElementTree, which has a different API for processing XML than the DOM.

Example

from xml.etree import ElementTree as et

data = '''\
<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
    <Header>
        <Reserved>2</Reserved>
        <CPU>1</CPU>
        <Flag>1</Flag>
        <VQI>12</VQI>
        <Group_ID>16</Group_ID>
        <DI>2</DI>
        <DE>1</DE>
        <ACOSS>5</ACOSS>
        <RGH>8</RGH>
    </Header>
</HeaderLookup>
'''

tree = et.fromstring(data)
for n in tree.find('Header'):
    print n.tag,'=',n.text

Output

Reserved = 2
CPU = 1
Flag = 1
VQI = 12
Group_ID = 16
DI = 2
DE = 1
ACOSS = 5
RGH = 8

Example (extending previous code)

The whitespace is still present, but it is in .tail attributes. tail is the text node that follows an element (between the end of one element and the start of the next), while text is the text node between the start/end tag of an element.

def dump(e):
    print '<%s>' % e.tag
    print 'text =',repr(e.text)
    for n in e:
        dump(n)
    print '</%s>' % e.tag
    print 'tail =',repr(e.tail)

dump(tree)

Output

<HeaderLookup>
text = '\n    '
<Header>
text = '\n        '
<Reserved>
text = '2'
</Reserved>
tail = '\n        '
<CPU>
text = '1'
</CPU>
tail = '\n        '
<Flag>
text = '1'
</Flag>
tail = '\n        '
<VQI>
text = '12'
</VQI>
tail = '\n        '
<Group_ID>
text = '16'
</Group_ID>
tail = '\n        '
<DI>
text = '2'
</DI>
tail = '\n        '
<DE>
text = '1'
</DE>
tail = '\n        '
<ACOSS>
text = '5'
</ACOSS>
tail = '\n        '
<RGH>
text = '8'
</RGH>
tail = '\n    '
</Header>
tail = '\n'
</HeaderLookup>
tail = None

183

answered Oct 13 '22 16:10

Mark Tolonen

Related questions
                            
                                OpenCV + python -- grab frames from a video file
                            
                                Get starred messages from GMail using IMAP4 and python
                            
                                Why does float() fail to convert my string to a float?
                            
                                Regex: How to match sequence of key-value pairs at end of string
                            
                                Why is Paramiko raising EOFError() when the SFTP object is stored in a dictionary?
                            
                                BigInteger in SQLAlchemy or not?
                            
                                Pyusb on Windows 7 cannot find any devices
                            
                                Listing indices using sqlalchemy
                            
                                How to add Python plug-in to Gnu Global
                            
                                how to use two level proxy setting in Python?
                            
                                python: use windows api to render text using a ttf font
                            
                                Python multiprocessing: synchronizing file-like object
                            
                                Building an MS Access database using python
                            
                                ipython and fork()
                            
                                Using Python 3.1 and 2.5 together
                            
                                Quickly Find the Index in an Array Closest to Some Value
                            
                                How to set a file's ctime with Python? [duplicate]
                            
                                Is it possible to get a "high water mark" of memory usage from Python?
                            
                                Trypsin digest (cleavage) does not work using regular expression
                            
                                Modifying axes on matplotlib colorbar plot of 2D array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I get Python XML to stop having wasted Child Nodes

Tags:

python

xml

whitespace

nodes

Dasmowenator

People also ask

1 Answers

Example

Output

Example (extending previous code)

Output

Mark Tolonen

Recent Activity

Donate For Us