Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing XML to file corrupts files in python

I'm attempting to write contents from xml.dom.minidom object to file. The simple idea is to use 'writexml' method:

import codecs

def write_xml_native():
    # Building DOM from XML
    xmldoc = minidom.parse('semio2.xml')
    f = codecs.open('codified.xml', mode='w', encoding='utf-8')
    # Using native writexml() method to write
    xmldoc.writexml(f, encoding="utf=8")
    f.close()

The problem is that it corrupts the non-latin-encoded text in the file. The other way is to get the text string and write it to file explicitly:

def write_xml():
    # Building DOM from XML
    xmldoc = minidom.parse('semio2.xml')
    # Opening file for writing UTF-8, which is XML's default encoding
    f = codecs.open('codified3.xml', mode='w', encoding='utf-8')
    # Writing XML in UTF-8 encoding, as recommended in the documentation
    f.write(xmldoc.toxml("utf-8"))
    f.close()

This results in the following error:

Traceback (most recent call last):
  File "D:\Projects\Semio\semioparser.py", line 45, in <module>
    write_xml()
  File "D:\Projects\Semio\semioparser.py", line 42, in write_xml
    f.write(xmldoc.toxml(encoding="utf-8"))
  File "C:\Python26\lib\codecs.py", line 686, in write
    return self.writer.write(data)
  File "C:\Python26\lib\codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 2064: ordinal not in range(128)

How do I write an XML text to file? What is it I'm missing?

EDIT. Error is fixed by adding decode statement: f.write(xmldoc.toxml("utf-8").decode("utf-8")) But russian symbols are still corrupted.

The text is not corrupted when viewed in an interpreter, but when it's written in file.

like image 725
martinthenext Avatar asked Dec 19 '10 17:12

martinthenext


1 Answers

Hmm, though this should work:

xml = minidom.parse("test.xml")
with codecs.open("out.xml", "w", "utf-8") as out:
    xml.writexml(out)

you may alternatively try:

with codecs.open("test.xml", "r", "utf-8") as inp:
    xml = minidom.parseString(inp.read().encode("utf-8"))
with codecs.open("out.xml", "w", "utf-8") as out:
    xml.writexml(out)

Update: In case you construct xml out of string object, you should encode it before passing to minidom parser, like this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import codecs
import xml.dom.minidom as minidom

xml = minidom.parseString(u"<ru>Тест</ru>".encode("utf-8"))
with codecs.open("out.xml", "w", "utf-8") as out:
    xml.writexml(out)
like image 176
barti_ddu Avatar answered Oct 15 '22 19:10

barti_ddu