Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing CDATA in xml with python

I need to parse an XML file with a number of blocks of CDATA that I need to retain for later plotting:

<process id="process1"> <log name="name1" device="device1"><![CDATA[timestamp value]]]></log> <log name="name2" device="device2"><![CDATA[timestamp value, timestamp value, timestamp]]]></log> </process>

I will need to do this repeatedly and quickly, and I am looking for the best way to do this. I've read that ElementTree is the faster of the methods, but I am open to other suggestions.

like image 460
Jen Avatar asked Dec 04 '12 00:12

Jen


People also ask

How do I use CDATA in XML?

A CDATA section begins with the character sequence <! [CDATA[ and ends with the character sequence ]]>. Between the two character sequences, an XML processor ignores all markup characters such as <, >, and &. The only markup an XML pro-cessor recognizes inside a CDATA section is the closing character sequence ]>.

What is XML parsing in Python?

Python allows parsing these XML documents using two modules namely, the xml. etree. ElementTree module and Minidom (Minimal DOM Implementation). Parsing means to read information from a file and split it into pieces by identifying parts of that particular XML file.

What does <![ CDATA in XML mean?

The term CDATA means, Character Data. CDATA is defined as blocks of text that are not parsed by the parser, but are otherwise recognized as markup. The predefined entities such as &lt;, &gt;, and &amp; require typing and are generally difficult to read in the markup.


1 Answers

Here are two examples of how to do it:

from lxml import etree
import xml.etree.ElementTree as ElementTree

CONTENT = """
<process id="process1">
 <log name="name1" device="device1"><![CDATA[timestamp value]]></log>
 <log name="name2" device="device2"><![CDATA[timestamp value, timestamp value, timestamp]]></log>
</process>
"""

def parse_with_lxml():
    root = etree.fromstring(CONTENT)
    for log in root.xpath("//log"):
        print log.text

def parse_with_stdlib():
    root = ElementTree.fromstring(CONTENT)
    for log in root.iter('log'):
        print log.text

if __name__ == '__main__':
    parse_with_lxml()
    parse_with_stdlib()

Output:

timestamp value
timestamp value, timestamp value, timestamp
timestamp value
timestamp value, timestamp value, timestamp

The text attribute it handles it in both cases.

like image 130
Joe Avatar answered Oct 16 '22 10:10

Joe