Central way to filter invalid unicode chars in lxml?

Question

It is common knowledge that certain character ranges aren't allowed in XML documents. I'm aware of solutions to filter those characters out (like [1], [2]).

Going with the Don't Repeat Yourself principle, I would prefer to implement one of these solutions in one central point – right now, I have to sanitize any potentially unsafe text before it is fed to lxml. Is there a way to achieve this, e.g. by subclassing a lxml filter class, catching some exceptions, or setting a configuration switch?

Edit: To hopefully clarify this question a bit, here a sample code:

from lxml import etree

root = etree.Element("root")
root.text = u'\uffff'
root.text += u'\ud800' 

print(etree.tostring(root))

root.text += '\x02'.decode("utf-8")

Executing this gives the result

<root>&#65535;&#55296;</root>

Traceback (most recent call last):
  File "[…]", line 9, in <module>
    root.text += u'\u0002'
  File "lxml.etree.pyx", line 953, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:44956)
  File "apihelpers.pxi", line 677, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20273)
  File "apihelpers.pxi", line 1395, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26485)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

As you see, an exception is thrown for the 2 byte, but lxml happily escapes the other two out of range characters. The real trouble is that

s = "<root>&#65535;&#55296;</root>"
root = etree.fromstring(s)

also throws an exception. This behavior is a bit unnerving in my opinion, especially because it produces invalid XML documents.

Turns out that this could be a 2 vs. 3 problem. With python3.4, the code above throws the exception

Traceback (most recent call last):
  File "[…]", line 5, in <module>
    root.text += u'\ud800'
  File "lxml.etree.pyx", line 953, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:44971)
  File "apihelpers.pxi", line 677, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20273)
  File "apihelpers.pxi", line 1387, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26380)
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 1: surrogates not allowed

The only remaining problem is the \uffff character, which lxml still happily accepts.

Lillian Seabreeze · Accepted Answer

Just filter the string before you parse it in LXML: cleaning invalid characters from XML (gist by lawlesst).

I tried it with your code; it seems to work, save the fact that you need to change the gist to import re and sys!

from lxml import etree
from cleaner import invalid_xml_remove

root = etree.Element("root")
root.text = u'\uffff'
root.text += u'\ud800' 

print(etree.tostring(root))

root.text += invalid_xml_remove('\x02'.decode("utf-8"))

Central way to filter invalid unicode chars in lxml?

Tags:

python

xml

unicode

invalid-characters

lxml

Percival Ulysses

1 Answers

Lillian Seabreeze

Recent Activity

Donate For Us

Central way to filter invalid unicode chars in lxml?

Tags:

python

xml

unicode

invalid-characters

lxml

Percival Ulysses

1 Answers

Lillian Seabreeze

Related questions

Recent Activity

Donate For Us