It is common knowledge that certain character ranges aren't allowed in XML documents. I'm aware of solutions to filter those characters out (like [1], [2]).
Going with the Don't Repeat Yourself principle, I would prefer to implement one of these solutions in one central point – right now, I have to sanitize any potentially unsafe text before it is fed to lxml
. Is there a way to achieve this, e.g. by subclassing a lxml
filter class, catching some exceptions, or setting a configuration switch?
Edit: To hopefully clarify this question a bit, here a sample code:
from lxml import etree
root = etree.Element("root")
root.text = u'\uffff'
root.text += u'\ud800'
print(etree.tostring(root))
root.text += '\x02'.decode("utf-8")
Executing this gives the result
<root>�</root>
Traceback (most recent call last):
File "[…]", line 9, in <module>
root.text += u'\u0002'
File "lxml.etree.pyx", line 953, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:44956)
File "apihelpers.pxi", line 677, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20273)
File "apihelpers.pxi", line 1395, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26485)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
As you see, an exception is thrown for the 2 byte, but lxml happily escapes the other two out of range characters. The real trouble is that
s = "<root>�</root>"
root = etree.fromstring(s)
also throws an exception. This behavior is a bit unnerving in my opinion, especially because it produces invalid XML documents.
Turns out that this could be a 2 vs. 3 problem. With python3.4, the code above throws the exception
Traceback (most recent call last):
File "[…]", line 5, in <module>
root.text += u'\ud800'
File "lxml.etree.pyx", line 953, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:44971)
File "apihelpers.pxi", line 677, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20273)
File "apihelpers.pxi", line 1387, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26380)
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 1: surrogates not allowed
The only remaining problem is the \uffff
character, which lxml
still happily accepts.
Just filter the string before you parse it in LXML: cleaning invalid characters from XML (gist by lawlesst).
I tried it with your code; it seems to work, save the fact that you need to change the gist to import re and sys!
from lxml import etree
from cleaner import invalid_xml_remove
root = etree.Element("root")
root.text = u'\uffff'
root.text += u'\ud800'
print(etree.tostring(root))
root.text += invalid_xml_remove('\x02'.decode("utf-8"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With