Writing an XML header with LXML

Question

I'm currently writing a script to convert a bunch of XML files from various encodings to a unified UTF-8.

I first try determining the encoding using LXML:

def get_source_encoding(self):
    tree = etree.parse(self.inputfile)
    encoding = tree.docinfo.encoding
    self.inputfile.seek(0)
    return (encoding or '').lower()

If that's blank, I try getting it from chardet:

def guess_source_encoding(self):
    chunk = self.inputfile.read(1024 * 10)
    self.inputfile.seek(0)
    return chardet.detect(chunk).lower()

I then use codecs to convert the encoding of the file:

def convert_encoding(self, source_encoding, input_filename, output_filename):
    chunk_size = 16 * 1024

    with codecs.open(input_filename, "rb", source_encoding) as source:
        with codecs.open(output_filename, "wb", "utf-8") as destination:
            while True:
                chunk = source.read(chunk_size)

                if not chunk:
                    break;

                destination.write(chunk)

Finally, I'm attempting to rewrite the XML header. If the XML header was originally

<?xml version="1.0"?>

or

<?xml version="1.0" encoding="windows-1255"?>

I'd like to transform it to

<?xml version="1.0" encoding="UTF-8"?>

My current code doesn't seem to work:

def edit_header(self, input_filename):
    output_filename = tempfile.mktemp(suffix=".xml")

    with open(input_filename, "rb") as source:
        parser = etree.XMLParser(encoding="UTF-8")
        tree = etree.parse(source, parser)

        with open(output_filename, "wb") as destination:
            tree.write(destination, encoding="UTF-8")

The file I'm currently testing has a header that doesn't specify the encoding. How can I make it output the header properly with the encoding specified?

Robᵩ · Accepted Answer

Try:

tree.write(destination, xml_declaration=True, encoding='UTF-8')

From the API docs:

xml_declaration controls if an XML declaration should be added to the file. Use False for never, True for always, None for only if not US-ASCII or UTF-8 (default is None).

Sample from ipython:

In [15]:  etree.ElementTree(etree.XML('<hi/>')).write(sys.stdout, xml_declaration=True, encoding='UTF-8')
<?xml version='1.0' encoding='UTF-8'?>
<hi/>

On reflection, I think you trying way too hard. lxml automatically detects the encoding and correctly parses the file according to that encoding.

So all you really have to do (at least in Python2.7) is:

def convert_encoding(self, source_encoding, input_filename, output_filename):
    tree = etree.parse(input_filename)
    with open(output_filename, 'w') as destination:
        tree.write(destination, encoding='utf-8', xml_declaration=True)

Writing an XML header with LXML

Tags:

python

xml

lxml

elementtree

Naftuli Kay

1 Answers

Robᵩ

Recent Activity

Donate For Us

Writing an XML header with LXML

Tags:

python

xml

lxml

elementtree

Naftuli Kay

1 Answers

Robᵩ

Related questions

Recent Activity

Donate For Us