I have a number of XML files I'd like to process with a script, converting them from whatever encoding that they're in to UTF-8.
Using the code given in this great answer I can do the conversion, but how can I read the encoding given in the XML header?
For example, I have many files which are already in UTF-8, which should be left alone:
<?xml version="1.0" encoding="utf-8"?>
However, I have a lot of files which do need to be converted:
<?xml version="1.0" encoding="windows-1255"?>
How can I detect the XML encoding specified in the headers of these files in Python? Better, after I detect and reencode the files, how then can I change this XML header to read "utf-8" to avoid processing it in the future?
Use lxml
to do the parsing; you can then access the original encoding with:
from lxml import etree
with open(filename, 'r') as xmlfile:
tree = etree.parse(xmlfile)
if tree.docinfo.encoding == 'utf-8':
# already in correct encoding, abort
return
You can then use lxml
to write the file out again in UTF-8.
How can I detect the XML encoding specified in the headers of these files in Python?
A solution by Rob Wolfe using only the standard library:
from xml.parsers import expat
s = """<?xml version='1.0' encoding='iso-8859-1'?>
<book>
<title>Title</title>
<chapter>Chapter 1</chapter>
</book>"""
class MyParser(object):
def XmlDecl(self, version, encoding, standalone):
print "XmlDecl", version, encoding, standalone
def Parse(self, data):
Parser = expat.ParserCreate()
Parser.XmlDeclHandler = self.XmlDecl
Parser.Parse(data, 1)
parser = MyParser()
parser.Parse(s)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With