Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast way to filter illegal xml unicode chars in python?

The XML specification lists a bunch of Unicode characters that are either illegal or "discouraged". Given a string, how can I remove all illegal characters from it?

I came up with the following regular expression, but it's a bit of a mouthful.

illegal_xml_re = re.compile(u'[\x00-\x08\x0b-\x1f\x7f-\x84\x86-\x9f\ud800-\udfff\ufdd0-\ufddf\ufffe-\uffff]') clean = illegal_xml_re.sub('', dirty) 

(Python 2.5 doesn't know about Unicode chars above 0xFFFF, so no need to filter those.)

like image 340
itsadok Avatar asked Nov 10 '09 13:11

itsadok


People also ask

How do I find illegal characters in XML?

If you're unable to identify this character visually, then you can use a text editor such as TextPad to view your source file. Within the application, use the Find function and select "hex" and search for the character mentioned. Removing these characters from your source file resolve the invalid XML character issue.

Are XML characters illegal?

The only illegal characters are & , < and > (as well as " or ' in attributes, depending on which character is used to delimit the attribute value: attr="must use &quot; here, ' is allowed" and attr='must use &apos; here, " is allowed' ). They're escaped using XML entities, in this case you want &amp; for & .


1 Answers

Recently we (Trac XmlRpcPlugin maintainers) have been notified of the fact that the regular expression above strips surrogate pairs on Python narrow builds (see th:comment:13:ticket:11050) . An alternative approach consists in using the following regex (see th:changeset:13729) .

_illegal_unichrs = [(0x00, 0x08), (0x0B, 0x0C), (0x0E, 0x1F),                          (0x7F, 0x84), (0x86, 0x9F),                          (0xFDD0, 0xFDDF), (0xFFFE, 0xFFFF)]  if sys.maxunicode >= 0x10000:  # not narrow build          _illegal_unichrs.extend([(0x1FFFE, 0x1FFFF), (0x2FFFE, 0x2FFFF),                                   (0x3FFFE, 0x3FFFF), (0x4FFFE, 0x4FFFF),                                   (0x5FFFE, 0x5FFFF), (0x6FFFE, 0x6FFFF),                                   (0x7FFFE, 0x7FFFF), (0x8FFFE, 0x8FFFF),                                   (0x9FFFE, 0x9FFFF), (0xAFFFE, 0xAFFFF),                                   (0xBFFFE, 0xBFFFF), (0xCFFFE, 0xCFFFF),                                   (0xDFFFE, 0xDFFFF), (0xEFFFE, 0xEFFFF),                                   (0xFFFFE, 0xFFFFF), (0x10FFFE, 0x10FFFF)])   _illegal_ranges = ["%s-%s" % (unichr(low), unichr(high))                     for (low, high) in _illegal_unichrs]  _illegal_xml_chars_RE = re.compile(u'[%s]' % u''.join(_illegal_ranges))  

p.s. See this post on surrogates explaining what they are for .

Update so as to not to match (replace) 0x0D which is a valid XML character.

like image 93
Olemis Lang Avatar answered Sep 20 '22 00:09

Olemis Lang