I'm able to get the value in the image tag (see XML below), but not the Category tag. The difference is one is a CDATA section and the other is just a string. Any help would be appreciated. <pre class="prettyprint"><code>from xml.dom import minidom xml = """<?xml version="1.0" ?> <ProductData> <ITEM Id="0471195"> <Category> <![CDATA[Homogenizers]]> </Category> <Image> 0471195.jpg </Image> </ITEM> <ITEM Id="0471195"> <Category> <![CDATA[Homogenizers]]> </Category> <Image> 0471196.jpg </Image> </ITEM> </ProductData> """ bad_xml_item_count = 0 data = {} xml_data = minidom.parseString(xml).getElementsByTagName('ProductData') parts = xml_data[0].getElementsByTagName('ITEM') for p in parts: try: part_id = p.attributes['Id'].value.strip() except(KeyError): bad_xml_item_count += 1 continue if not part_id: bad_xml_item_count += 1 continue part_image = p.getElementsByTagName('Image')[0].firstChild.nodeValue.strip() part_category = p.getElementsByTagName('Category')[0].firstChild.data.strip() print '\t'.join([part_id, part_category, part_image]) </code></pre>

I've ran into a similar problem. My solution was similar to what ironfroggy answered, but implemented in a more general fashion: <pre class="prettyprint"><code>for node in parentNode.childNodes: if node.nodeType == 4: cdataContent = node.data.strip() </code></pre> CDATA's node type is 4 (<code>CDATA_SECTION_NODE</code>)

xml.dom.minidom: Getting CDATA values

Tags:

python

xml

I'm able to get the value in the image tag (see XML below), but not the Category tag. The difference is one is a CDATA section and the other is just a string. Any help would be appreciated.

from xml.dom import minidom

xml = """<?xml version="1.0" ?>
<ProductData>
    <ITEM Id="0471195">
        <Category>
            <![CDATA[Homogenizers]]>        
        </Category>
        <Image>
            0471195.jpg
        </Image>
    </ITEM>
    <ITEM Id="0471195">
        <Category>
            <![CDATA[Homogenizers]]>        
        </Category>
        <Image>
            0471196.jpg
        </Image>
    </ITEM>
</ProductData>
"""

bad_xml_item_count = 0
data = {}
xml_data = minidom.parseString(xml).getElementsByTagName('ProductData')
parts = xml_data[0].getElementsByTagName('ITEM')
for p in parts:
    try:
        part_id = p.attributes['Id'].value.strip()
    except(KeyError):
        bad_xml_item_count += 1
        continue
    if not part_id:
        bad_xml_item_count += 1
        continue
    part_image = p.getElementsByTagName('Image')[0].firstChild.nodeValue.strip()
    part_category = p.getElementsByTagName('Category')[0].firstChild.data.strip()
    print '\t'.join([part_id, part_category, part_image])

524

asked Feb 27 '09 23:02

Jason Coon

3 Answers

p.getElementsByTagName('Category')[0].firstChild

minidom does not flatten away <![CDATA[ sections to plain text, it leaves them as DOM CDATASection nodes. (Arguably it should, at least optionally. DOM Level 3 LS defaults to flattening them, for what it's worth, but minidom is much older than DOM L3.)

So the firstChild of Category is a Text node representing the whitespace between the <Category> open tag and the start of the CDATA section. It has two siblings: the CDATASection node, and another trailing whitespace Text node.

What you probably want is the textual data of all children of Category. In DOM Level 3 Core you'd just call:

p.getElementsByTagName('Category')[0].textContent

but minidom doesn't support that yet. Recent versions do, however, support another Level 3 method you can use to do the same thing in a more roundabout way:

p.getElementsByTagName('Category')[0].firstChild.wholeText

145

answered Oct 08 '22 21:10

bobince

CDATA is its own node, so the Category elements here actually have three children, a whitespace text node, the CDATA node, and another whitespace node. You're just looking at the wrong one, is all. I don't see any more obvious way to query for the CDATA node, but you can pull it out like this:

[n for n in category.childNodes if n.nodeType==category.CDATA_SECTION_NODE][0]

answered Oct 08 '22 19:10

ironfroggy

I've ran into a similar problem. My solution was similar to what ironfroggy answered, but implemented in a more general fashion:

for node in parentNode.childNodes:
        if node.nodeType == 4:
            cdataContent = node.data.strip()

CDATA's node type is 4 (CDATA_SECTION_NODE)

answered Oct 08 '22 20:10

BBog

Related questions
                            
                                Browse for file path in python
                            
                                Fit a curve for data made up of two distinct regimes
                            
                                Flask-WTF / WTForms with Unittest fails validation, but works without Unittest
                            
                                Difference between using [] and list() in Python
                            
                                Using a websocket client as a class in python
                            
                                Django translations does not work
                            
                                Show the SQL generated by Flask-SQLAlchemy
                            
                                How to setup Atom's script to run Python 3.x scripts? May the combination with Windows 7 Pro x64 be the issue?
                            
                                Unable to install Python 3.5 within Windows XP Professional
                            
                                Pitch detection in Python
                            
                                django selenium LiveServerTestCase
                            
                                Overhead of creating classes in Python: Exact same code using class twice as slow as native DS?
                            
                                Parse human-readable filesizes into bytes
                            
                                AttributeError: 'Ui_MainWindow' object has no attribute 'setCentralWidget'
                            
                                Convert base64 String to an Image that's compatible with OpenCV
                            
                                Append tfidf to pandas dataframe
                            
                                Using PhraseMatcher in SpaCy to find multiple match types
                            
                                How to count longest uninterrupted sequence in pandas
                            
                                store results ThreadPoolExecutor
                            
                                ImportError: DLL load failed while importing aggregations: The specified module could not be found

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With