Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

xml.dom.minidom: Getting CDATA values

Tags:

python

xml

I'm able to get the value in the image tag (see XML below), but not the Category tag. The difference is one is a CDATA section and the other is just a string. Any help would be appreciated.

from xml.dom import minidom

xml = """<?xml version="1.0" ?>
<ProductData>
    <ITEM Id="0471195">
        <Category>
            <![CDATA[Homogenizers]]>        
        </Category>
        <Image>
            0471195.jpg
        </Image>
    </ITEM>
    <ITEM Id="0471195">
        <Category>
            <![CDATA[Homogenizers]]>        
        </Category>
        <Image>
            0471196.jpg
        </Image>
    </ITEM>
</ProductData>
"""

bad_xml_item_count = 0
data = {}
xml_data = minidom.parseString(xml).getElementsByTagName('ProductData')
parts = xml_data[0].getElementsByTagName('ITEM')
for p in parts:
    try:
        part_id = p.attributes['Id'].value.strip()
    except(KeyError):
        bad_xml_item_count += 1
        continue
    if not part_id:
        bad_xml_item_count += 1
        continue
    part_image = p.getElementsByTagName('Image')[0].firstChild.nodeValue.strip()
    part_category = p.getElementsByTagName('Category')[0].firstChild.data.strip()
    print '\t'.join([part_id, part_category, part_image])
like image 524
Jason Coon Avatar asked Feb 27 '09 23:02

Jason Coon


People also ask

What is CDATA section in XML?

A CDATA section is used to mark a section of an XML document, so that the XML parser interprets it only as character data, and not as markup. It comes handy when one XML data need to be embedded within another XML document.

What is the correct syntax of the CDATA section in an XML document?

A CDATA section begins with the character sequence <! [CDATA[ and ends with the character sequence ]]>. Between the two character sequences, an XML processor ignores all markup characters such as <, >, and &. The only markup an XML pro-cessor recognizes inside a CDATA section is the closing character sequence ]>.

Can Python access Dom?

To access the JavaScript DOM, we will be using a Python package called “JyServer.” You can simply install it with pip like this. Of course, you will also need Flask installed, which you can install like this.

What is CDATA in HTML?

The CDATA Section interface is used within XML for including extended portions of text. This text is unescaped text, like < and & symbols. These symbols do not want to escape. It is used like this: <![


3 Answers

p.getElementsByTagName('Category')[0].firstChild

minidom does not flatten away <![CDATA[ sections to plain text, it leaves them as DOM CDATASection nodes. (Arguably it should, at least optionally. DOM Level 3 LS defaults to flattening them, for what it's worth, but minidom is much older than DOM L3.)

So the firstChild of Category is a Text node representing the whitespace between the <Category> open tag and the start of the CDATA section. It has two siblings: the CDATASection node, and another trailing whitespace Text node.

What you probably want is the textual data of all children of Category. In DOM Level 3 Core you'd just call:

p.getElementsByTagName('Category')[0].textContent

but minidom doesn't support that yet. Recent versions do, however, support another Level 3 method you can use to do the same thing in a more roundabout way:

p.getElementsByTagName('Category')[0].firstChild.wholeText
like image 145
bobince Avatar answered Oct 08 '22 21:10

bobince


CDATA is its own node, so the Category elements here actually have three children, a whitespace text node, the CDATA node, and another whitespace node. You're just looking at the wrong one, is all. I don't see any more obvious way to query for the CDATA node, but you can pull it out like this:

[n for n in category.childNodes if n.nodeType==category.CDATA_SECTION_NODE][0]
like image 8
ironfroggy Avatar answered Oct 08 '22 19:10

ironfroggy


I've ran into a similar problem. My solution was similar to what ironfroggy answered, but implemented in a more general fashion:

for node in parentNode.childNodes:
        if node.nodeType == 4:
            cdataContent = node.data.strip()

CDATA's node type is 4 (CDATA_SECTION_NODE)

like image 3
BBog Avatar answered Oct 08 '22 20:10

BBog