Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing an XML header with LXML

I'm currently writing a script to convert a bunch of XML files from various encodings to a unified UTF-8.

I first try determining the encoding using LXML:

def get_source_encoding(self):
    tree = etree.parse(self.inputfile)
    encoding = tree.docinfo.encoding
    self.inputfile.seek(0)
    return (encoding or '').lower()

If that's blank, I try getting it from chardet:

def guess_source_encoding(self):
    chunk = self.inputfile.read(1024 * 10)
    self.inputfile.seek(0)
    return chardet.detect(chunk).lower()

I then use codecs to convert the encoding of the file:

def convert_encoding(self, source_encoding, input_filename, output_filename):
    chunk_size = 16 * 1024

    with codecs.open(input_filename, "rb", source_encoding) as source:
        with codecs.open(output_filename, "wb", "utf-8") as destination:
            while True:
                chunk = source.read(chunk_size)

                if not chunk:
                    break;

                destination.write(chunk)

Finally, I'm attempting to rewrite the XML header. If the XML header was originally

<?xml version="1.0"?>

or

<?xml version="1.0" encoding="windows-1255"?>

I'd like to transform it to

<?xml version="1.0" encoding="UTF-8"?>

My current code doesn't seem to work:

def edit_header(self, input_filename):
    output_filename = tempfile.mktemp(suffix=".xml")

    with open(input_filename, "rb") as source:
        parser = etree.XMLParser(encoding="UTF-8")
        tree = etree.parse(source, parser)

        with open(output_filename, "wb") as destination:
            tree.write(destination, encoding="UTF-8")

The file I'm currently testing has a header that doesn't specify the encoding. How can I make it output the header properly with the encoding specified?

like image 516
Naftuli Kay Avatar asked Sep 12 '14 00:09

Naftuli Kay


1 Answers

Try:

tree.write(destination, xml_declaration=True, encoding='UTF-8')

From the API docs:

xml_declaration controls if an XML declaration should be added to the file. Use False for never, True for always, None for only if not US-ASCII or UTF-8 (default is None).

Sample from ipython:

In [15]:  etree.ElementTree(etree.XML('<hi/>')).write(sys.stdout, xml_declaration=True, encoding='UTF-8')
<?xml version='1.0' encoding='UTF-8'?>
<hi/>

On reflection, I think you trying way too hard. lxml automatically detects the encoding and correctly parses the file according to that encoding.

So all you really have to do (at least in Python2.7) is:

def convert_encoding(self, source_encoding, input_filename, output_filename):
    tree = etree.parse(input_filename)
    with open(output_filename, 'w') as destination:
        tree.write(destination, encoding='utf-8', xml_declaration=True)
like image 177
Robᵩ Avatar answered Oct 27 '22 11:10

Robᵩ