I'm able to get the value in the image tag (see XML below), but not the Category tag. The difference is one is a CDATA section and the other is just a string. Any help would be appreciated.
from xml.dom import minidom
xml = """<?xml version="1.0" ?>
<ProductData>
<ITEM Id="0471195">
<Category>
<![CDATA[Homogenizers]]>
</Category>
<Image>
0471195.jpg
</Image>
</ITEM>
<ITEM Id="0471195">
<Category>
<![CDATA[Homogenizers]]>
</Category>
<Image>
0471196.jpg
</Image>
</ITEM>
</ProductData>
"""
bad_xml_item_count = 0
data = {}
xml_data = minidom.parseString(xml).getElementsByTagName('ProductData')
parts = xml_data[0].getElementsByTagName('ITEM')
for p in parts:
try:
part_id = p.attributes['Id'].value.strip()
except(KeyError):
bad_xml_item_count += 1
continue
if not part_id:
bad_xml_item_count += 1
continue
part_image = p.getElementsByTagName('Image')[0].firstChild.nodeValue.strip()
part_category = p.getElementsByTagName('Category')[0].firstChild.data.strip()
print '\t'.join([part_id, part_category, part_image])
A CDATA section is used to mark a section of an XML document, so that the XML parser interprets it only as character data, and not as markup. It comes handy when one XML data need to be embedded within another XML document.
A CDATA section begins with the character sequence <! [CDATA[ and ends with the character sequence ]]>. Between the two character sequences, an XML processor ignores all markup characters such as <, >, and &. The only markup an XML pro-cessor recognizes inside a CDATA section is the closing character sequence ]>.
To access the JavaScript DOM, we will be using a Python package called “JyServer.” You can simply install it with pip like this. Of course, you will also need Flask installed, which you can install like this.
The CDATA Section interface is used within XML for including extended portions of text. This text is unescaped text, like < and & symbols. These symbols do not want to escape. It is used like this: <![
p.getElementsByTagName('Category')[0].firstChild
minidom does not flatten away <![CDATA[ sections to plain text, it leaves them as DOM CDATASection nodes. (Arguably it should, at least optionally. DOM Level 3 LS defaults to flattening them, for what it's worth, but minidom is much older than DOM L3.)
So the firstChild of Category is a Text node representing the whitespace between the <Category> open tag and the start of the CDATA section. It has two siblings: the CDATASection node, and another trailing whitespace Text node.
What you probably want is the textual data of all children of Category. In DOM Level 3 Core you'd just call:
p.getElementsByTagName('Category')[0].textContent
but minidom doesn't support that yet. Recent versions do, however, support another Level 3 method you can use to do the same thing in a more roundabout way:
p.getElementsByTagName('Category')[0].firstChild.wholeText
CDATA is its own node, so the Category elements here actually have three children, a whitespace text node, the CDATA node, and another whitespace node. You're just looking at the wrong one, is all. I don't see any more obvious way to query for the CDATA node, but you can pull it out like this:
[n for n in category.childNodes if n.nodeType==category.CDATA_SECTION_NODE][0]
I've ran into a similar problem. My solution was similar to what ironfroggy answered, but implemented in a more general fashion:
for node in parentNode.childNodes:
if node.nodeType == 4:
cdataContent = node.data.strip()
CDATA's node type is 4 (CDATA_SECTION_NODE
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With