Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LXML kills my CDATA sections

I'm batch-converting a lot of XML files, changing their character encodings to UTF-8:

with open(source_filename, "rb") as source:
    tree = etree.parse(source)

    with open(destination_filename, "wb") as destination:
        tree.write(destination, encoding="UTF-8", xml_declaration=True)

Unfortunately, it is destroying my CDATA sections and just escaping them instead.

Source:

<d><![CDATA[áÌÀøÅàùÑÄéú ëÌÄé áÈàÅùÑ éäå''ä ðÄùÑÀôÌÈè <small><small>(ùí ëå èæ)</small></small>

Destination:

<d>בְּרֵאשִׁית כִּי בָאֵשׁ יהו''ה נִשְׁפָּט &lt;small&gt;&lt;small&gt;(שם כו טז)&lt;/small&gt;&lt;/small&gt;

Is there a setting which I can set which will tell it to leave my CDATA sections alone? I'm mainly using LXML to change the character encoding and to write the XML header properly.

like image 285
Naftuli Kay Avatar asked Sep 12 '14 17:09

Naftuli Kay


1 Answers

Use the strip_cdata=False option:

import lxml.etree as etree
parser = etree.XMLParser(strip_cdata=False)
with open(source_filename, "rb") as source:
    tree = etree.parse(source, parser=parser)
like image 162
unutbu Avatar answered Sep 29 '22 10:09

unutbu