I'm batch-converting a lot of XML files, changing their character encodings to UTF-8:
with open(source_filename, "rb") as source:
tree = etree.parse(source)
with open(destination_filename, "wb") as destination:
tree.write(destination, encoding="UTF-8", xml_declaration=True)
Unfortunately, it is destroying my CDATA
sections and just escaping them instead.
Source:
<d><![CDATA[áÌÀøÅàùÑÄéú ëÌÄé áÈàÅùÑ éäå''ä ðÄùÑÀôÌÈè <small><small>(ùí ëå èæ)</small></small>
Destination:
<d>בְּרֵאשִׁית כִּי בָאֵשׁ יהו''ה נִשְׁפָּט <small><small>(שם כו טז)</small></small>
Is there a setting which I can set which will tell it to leave my CDATA sections alone? I'm mainly using LXML to change the character encoding and to write the XML header properly.
Use the strip_cdata=False
option:
import lxml.etree as etree
parser = etree.XMLParser(strip_cdata=False)
with open(source_filename, "rb") as source:
tree = etree.parse(source, parser=parser)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With