Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python: escaping non-ascii characters in XML

Tags:

python

xml

I got my test XML file to print using the following source file, but it doesn't handle non-ASCII characters appropriately:

xmltest.py:

import xml.sax.xmlreader
import xml.sax.saxutils

def testJunk(file, e2content):
  attr0 = xml.sax.xmlreader.AttributesImpl({})
  x =  xml.sax.saxutils.XMLGenerator(file)
  x.startDocument()
  x.startElement("document", attr0)

  x.startElement("element1", attr0)
  x.characters("bingo")
  x.endElement("element1")

  x.startElement("element2", attr0)
  x.characters(e2content)
  x.endElement("element2")

  x.endElement("document")
  x.endDocument()

If I do

>>> import xmltest
>>> xmltest.testJunk(open("test.xml","w"), "ascii 001: \001")

then I get an xml file with character code 001 in it. I can't figure out how to escape this character. Firefox tells me it's not well formed XML and complains about that character. How can I fix this?

clarification: I'm trying to log the output of a function I do not have control over, which outputs non-ASCII characters.


update: OK, so now I know characters outside one of the accepted ranges can't be encoded in the form . (Or rather, they can be encoded, but that doesn't help any w/r/t XML not being well-formed.) But they can be escaped if I define a way of doing so.

(for future reference: W3C has a useful page outside the XML standard itself which says "Control codes should be replaced with appropriate markup" but doesn't really suggest any examples for doing so.)

If I wanted to escape characters outside the accepted range in the following way:

before escaping: ( represents one character, not the literal 8-character string)

 abcdefghijkl

after escaping:

 abcd<u>0001</u>efgh<u>0002</u>ijkl

How could I do this in python?

def escapeXML(src)
    dest = ??????
    return dest
like image 512
Jason S Avatar asked Dec 22 '10 21:12

Jason S


2 Answers

"\001" aka "\x01" is an ASCII control code. It is not however one of the permitted XML characters. The only ASCII control codes which qualify are "\t", "\n" and "\r".

Examples:

>>> import xml.etree.cElementTree as ET
# Raw newline works
>>> t = ET.fromstring("<e>\n</e>")
>>> t.text
'\n'

# Hex escaping of a newline works
>>> t = ET.fromstring("<e>&#xa;</e>")
>>> t.text
'\n'

# Hex escaping of "\x01" doesn't work; it's not a valid XML character
>>> t = ET.fromstring("<e>&#x1;</e>")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 106, in XML
cElementTree.ParseError: reference to invalid character number: line 1, column 3

If you want to include invalid XML characters somehow in an XML document, they must be hidden from the XML parser by an extra level of escaping. The mechanism needs to be documented, published, and understood by readers of your document.

For example, in Microsoft Excel 2007+ XLSX files, Unicode code points which are not valid XML characters are smuggled past the parser by representing them as _xhhhh_ where hhhh is the hex representation of the codepoint. In your example this would be the 7 bytes _x0001_. Note that it is necessary to escape any _ characters in the text that would otherwise be falsely interpreted as introducing an _xhhhh_ sequence.

This is ugly, painful, inefficient, etc. You may wish to consider other methods. Is use of XML necessary? Would a CSV file (shock, horror!) do a better job in your application?

Edit Some notes on the OP's encoding proposal:

A. Although \r is a valid XML 1.0 input character, it is subject to mandatory immediate transmogrification, so you should escape it as well.

B. This scheme assumes/hopes that the <u>hhhh</u> cannot be confused with any other markup.

C. I take back what I said above about the Microsoft escaping scheme. It is relatively beautiful, painfree, and efficient. To complete the picture of your scheme for your gentle readers, you should show the code that is required to unescape the nasty bits and glue the pieces back together. Bear in mind that the MS scheme requires somebody to write one escaping function and one unescaping function, whereas your scheme requires different treatment for each tool (SAX, DOM, ElementTree).

D. At the detailed level, the code is a little bit whiffy:

if (len(g1) > 0): should be if g1:

if (not foo == None): has a record THREE deviations from the commonly accepted idiom: (1) the parentheses (2) not x == y instead of x != y (3) != None instead of is not None

Don't use list (and names of other built-in objects) as a name for your own variable.

Edit 2 You want to split up a string using a regex. Why not use re.split?

splitInvalidXML2 = re.compile(
    ur'([^\x09\x0A\x0D\x20-\x7E\x85\xA0-\xFF\u0100-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD])'
    ).split

def submitCharacters2(x, string):
    badchar = True
    for fragment in splitInvalidXML2(string):
        badchar = not badchar
        if badchar:
            x.startElement("u", attr0)
            x.characters('%04X' % ord(fragment))
            x.endElement("u")
        elif fragment:
            x.characters(fragment)
like image 95
John Machin Avatar answered Sep 26 '22 03:09

John Machin


This seems to work for me.

r = re.compile(ur'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\xFF' \
  + ur'\u0100-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]')
def escapeInvalidXML(string):
  def replacer(m):
    return "<u>"+('%04X' % ord(m.group(0)))+"</u>"
  return re.sub(r,replacer,string)

example:

>>> s='this is a \x01 test \x0B of something'
>>> escapeInvalidXML(s)
'this is a <u>0001</u> test <u>000B</u> of something'
>>> s2 = u'this is a \x01 test \x0B of \uFDD0'
>>> escapeInvalidXML(s2)
u'this is a <u>0001</u> test <u>000B</u> of <u>FDD0</u>'

Character ranges from http://www.w3.org/TR/2006/REC-xml-20060816/#charsets, and I haven't escaped everything, just the ones below \uFFFF.


Update: Oops, forgot to adapt to the startElement/characters methods of SAX, & deal properly with multiple lines:

import re
import xml.sax.xmlreader
import xml.sax.saxutils

r = re.compile(ur'(.*?)(?:([^\x09\x0A\x0D\x20-\x7E\x85\xA0-\xFF' \
    + ur'\u0100-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD])|([\n])|$)')
attr0 = xml.sax.xmlreader.AttributesImpl({})
def splitInvalidXML(string):
    list = []
    def replacer(m):
        g1 = m.group(1)
        if (len(g1) > 0):
            list.append(g1)
        g2 = m.group(2)
        if (not g2 == None):
            list.append(ord(g2))
        g3 = m.group(3)
        if (not g3 == None):
            list.append(g3)
        return ""
    re.sub(r,replacer,string)
    return list

def submitCharacters(x, string):
    for fragment in splitInvalidXML(string):
        if (isinstance(fragment,int)):
            x.startElement("u", attr0)
            x.characters('%04X' % fragment)
            x.endElement("u")
        else:
            x.characters(fragment)

def test1(fname):
    with open(fname,'w') as f:
        x = xml.sax.saxutils.XMLGenerator(f)
        x.startDocument()
        x.startElement('document',attr0)
        submitCharacters(x, 'this is a \x01 test\nof the \x02\x0b xml system.')
        x.endElement('document')
        x.endDocument()

test1('test.xml')

This produces:

<?xml version="1.0" encoding="iso-8859-1"?>
<document>this is a <u>0001</u> test
of the <u>0002</u><u>000B</u> xml system.</document>
like image 27
Jason S Avatar answered Sep 25 '22 03:09

Jason S