I need to parse an XML file with a number of blocks of CDATA that I need to retain for later plotting:
<process id="process1">
<log name="name1" device="device1"><![CDATA[timestamp value]]]></log>
<log name="name2" device="device2"><![CDATA[timestamp value, timestamp value, timestamp]]]></log>
</process>
I will need to do this repeatedly and quickly, and I am looking for the best way to do this. I've read that ElementTree is the faster of the methods, but I am open to other suggestions.
A CDATA section begins with the character sequence <! [CDATA[ and ends with the character sequence ]]>. Between the two character sequences, an XML processor ignores all markup characters such as <, >, and &. The only markup an XML pro-cessor recognizes inside a CDATA section is the closing character sequence ]>.
Python allows parsing these XML documents using two modules namely, the xml. etree. ElementTree module and Minidom (Minimal DOM Implementation). Parsing means to read information from a file and split it into pieces by identifying parts of that particular XML file.
The term CDATA means, Character Data. CDATA is defined as blocks of text that are not parsed by the parser, but are otherwise recognized as markup. The predefined entities such as <, >, and & require typing and are generally difficult to read in the markup.
Here are two examples of how to do it:
from lxml import etree
import xml.etree.ElementTree as ElementTree
CONTENT = """
<process id="process1">
<log name="name1" device="device1"><![CDATA[timestamp value]]></log>
<log name="name2" device="device2"><![CDATA[timestamp value, timestamp value, timestamp]]></log>
</process>
"""
def parse_with_lxml():
root = etree.fromstring(CONTENT)
for log in root.xpath("//log"):
print log.text
def parse_with_stdlib():
root = ElementTree.fromstring(CONTENT)
for log in root.iter('log'):
print log.text
if __name__ == '__main__':
parse_with_lxml()
parse_with_stdlib()
Output:
timestamp value
timestamp value, timestamp value, timestamp
timestamp value
timestamp value, timestamp value, timestamp
The text attribute it handles it in both cases.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With