I'm currently writing a script to convert a bunch of XML files from various encodings to a unified UTF-8.
I first try determining the encoding using LXML:
def get_source_encoding(self):
tree = etree.parse(self.inputfile)
encoding = tree.docinfo.encoding
self.inputfile.seek(0)
return (encoding or '').lower()
If that's blank, I try getting it from chardet
:
def guess_source_encoding(self):
chunk = self.inputfile.read(1024 * 10)
self.inputfile.seek(0)
return chardet.detect(chunk).lower()
I then use codecs
to convert the encoding of the file:
def convert_encoding(self, source_encoding, input_filename, output_filename):
chunk_size = 16 * 1024
with codecs.open(input_filename, "rb", source_encoding) as source:
with codecs.open(output_filename, "wb", "utf-8") as destination:
while True:
chunk = source.read(chunk_size)
if not chunk:
break;
destination.write(chunk)
Finally, I'm attempting to rewrite the XML header. If the XML header was originally
<?xml version="1.0"?>
or
<?xml version="1.0" encoding="windows-1255"?>
I'd like to transform it to
<?xml version="1.0" encoding="UTF-8"?>
My current code doesn't seem to work:
def edit_header(self, input_filename):
output_filename = tempfile.mktemp(suffix=".xml")
with open(input_filename, "rb") as source:
parser = etree.XMLParser(encoding="UTF-8")
tree = etree.parse(source, parser)
with open(output_filename, "wb") as destination:
tree.write(destination, encoding="UTF-8")
The file I'm currently testing has a header that doesn't specify the encoding. How can I make it output the header properly with the encoding specified?
Try:
tree.write(destination, xml_declaration=True, encoding='UTF-8')
From the API docs:
xml_declaration controls if an XML declaration should be added to the file. Use
False
for never,True
for always,None
for only if not US-ASCII or UTF-8 (default isNone
).
Sample from ipython:
In [15]: etree.ElementTree(etree.XML('<hi/>')).write(sys.stdout, xml_declaration=True, encoding='UTF-8')
<?xml version='1.0' encoding='UTF-8'?>
<hi/>
On reflection, I think you trying way too hard. lxml
automatically detects the encoding and correctly parses the file according to that encoding.
So all you really have to do (at least in Python2.7) is:
def convert_encoding(self, source_encoding, input_filename, output_filename):
tree = etree.parse(input_filename)
with open(output_filename, 'w') as destination:
tree.write(destination, encoding='utf-8', xml_declaration=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With