I have a simple XML document I'm trying to read in with Python DOM (see below):
XML File:
<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
<Header>
<Reserved>2</Reserved>
<CPU>1</CPU>
<Flag>1</Flag>
<VQI>12</VQI>
<Group_ID>16</Group_ID>
<DI>2</DI>
<DE>1</DE>
<ACOSS>5</ACOSS>
<RGH>8</RGH>
</Header>
</HeaderLookup>
Python Code:
from xml.dom import minidom
xml_file = open("test.xml")
xmlroot = minidom.parse(xml_file).documentElement
xml_file.close()
for item in xmlroot.getElementsByTagName("Header")[0].childNodes:
print item
Result:
<DOM Text node "u'\n\t\t'">
<DOM Element: Reserved at 0x28d2828>
<DOM Text node "u'\n\t\t'">
<DOM Element: CPU at 0x28d28c8>
<DOM Text node "u'\n\t\t'">
<DOM Element: Flag at 0x28d2968>
<DOM Text node "u'\n\t\t'">
<DOM Element: VQI at 0x28d2a08>
<DOM Text node "u'\n\t\t'">
<DOM Element: Group_ID at 0x28d2ad0>
<DOM Text node "u'\n\t\t'">
<DOM Element: DI at 0x28d2b70>
<DOM Text node "u'\n\t\t'">
<DOM Element: DE at 0x28d2c10>
<DOM Text node "u'\n\t\t'">
<DOM Element: ACOSS at 0x28d2cb0>
<DOM Text node "u'\n\t\t'">
<DOM Element: RGH at 0x28d2d50>
<DOM Text node "u'\n\t'">
The result should be 9 Child Nodes (Reserved, CPU, Flag, VQI, Group_ID, DI, DE, ACOSS, and RGH), but for some reason it is returning a list of 19 nodes with 10 of them being whitespace (why is this even being considered a node in the first place?!). Can anyone tell me if there's a way to get the XML parser to not include whitespace nodes?
There are two ways to parse the file using 'ElementTree' module. The first is by using the parse() function and the second is fromstring() function. The parse () function parses XML document which is supplied as a file whereas, fromstring parses XML when supplied as a string i.e within triple quotes.
This is a read-only property containing a node list of all children for those elements that can have them. The childNodes property is a read-only property containing a node list of all children for those elements that can have them.
ElementTree is an important Python library that allows you to parse and navigate an XML document. Using ElementTree breaks down the XML document in a tree structure that is easy to work with.
Whitespace is significant in XML, but check out ElementTree, which has a different API for processing XML than the DOM.
from xml.etree import ElementTree as et
data = '''\
<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
<Header>
<Reserved>2</Reserved>
<CPU>1</CPU>
<Flag>1</Flag>
<VQI>12</VQI>
<Group_ID>16</Group_ID>
<DI>2</DI>
<DE>1</DE>
<ACOSS>5</ACOSS>
<RGH>8</RGH>
</Header>
</HeaderLookup>
'''
tree = et.fromstring(data)
for n in tree.find('Header'):
print n.tag,'=',n.text
Reserved = 2
CPU = 1
Flag = 1
VQI = 12
Group_ID = 16
DI = 2
DE = 1
ACOSS = 5
RGH = 8
The whitespace is still present, but it is in .tail
attributes. tail
is the text node that follows an element (between the end of one element and the start of the next), while text
is the text node between the start/end tag of an element.
def dump(e):
print '<%s>' % e.tag
print 'text =',repr(e.text)
for n in e:
dump(n)
print '</%s>' % e.tag
print 'tail =',repr(e.tail)
dump(tree)
<HeaderLookup>
text = '\n '
<Header>
text = '\n '
<Reserved>
text = '2'
</Reserved>
tail = '\n '
<CPU>
text = '1'
</CPU>
tail = '\n '
<Flag>
text = '1'
</Flag>
tail = '\n '
<VQI>
text = '12'
</VQI>
tail = '\n '
<Group_ID>
text = '16'
</Group_ID>
tail = '\n '
<DI>
text = '2'
</DI>
tail = '\n '
<DE>
text = '1'
</DE>
tail = '\n '
<ACOSS>
text = '5'
</ACOSS>
tail = '\n '
<RGH>
text = '8'
</RGH>
tail = '\n '
</Header>
tail = '\n'
</HeaderLookup>
tail = None
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With