Fast way to filter illegal xml unicode chars in python?

Tags:

The XML specification lists a bunch of Unicode characters that are either illegal or "discouraged". Given a string, how can I remove all illegal characters from it?

I came up with the following regular expression, but it's a bit of a mouthful.

illegal_xml_re = re.compile(u'[\x00-\x08\x0b-\x1f\x7f-\x84\x86-\x9f\ud800-\udfff\ufdd0-\ufddf\ufffe-\uffff]') clean = illegal_xml_re.sub('', dirty)

(Python 2.5 doesn't know about Unicode chars above 0xFFFF, so no need to filter those.)

340

asked Nov 10 '09 13:11

itsadok

1 Answers

Recently we (Trac XmlRpcPlugin maintainers) have been notified of the fact that the regular expression above strips surrogate pairs on Python narrow builds (see th:comment:13:ticket:11050) . An alternative approach consists in using the following regex (see th:changeset:13729) .

_illegal_unichrs = [(0x00, 0x08), (0x0B, 0x0C), (0x0E, 0x1F),                          (0x7F, 0x84), (0x86, 0x9F),                          (0xFDD0, 0xFDDF), (0xFFFE, 0xFFFF)]  if sys.maxunicode >= 0x10000:  # not narrow build          _illegal_unichrs.extend([(0x1FFFE, 0x1FFFF), (0x2FFFE, 0x2FFFF),                                   (0x3FFFE, 0x3FFFF), (0x4FFFE, 0x4FFFF),                                   (0x5FFFE, 0x5FFFF), (0x6FFFE, 0x6FFFF),                                   (0x7FFFE, 0x7FFFF), (0x8FFFE, 0x8FFFF),                                   (0x9FFFE, 0x9FFFF), (0xAFFFE, 0xAFFFF),                                   (0xBFFFE, 0xBFFFF), (0xCFFFE, 0xCFFFF),                                   (0xDFFFE, 0xDFFFF), (0xEFFFE, 0xEFFFF),                                   (0xFFFFE, 0xFFFFF), (0x10FFFE, 0x10FFFF)])   _illegal_ranges = ["%s-%s" % (unichr(low), unichr(high))                     for (low, high) in _illegal_unichrs]  _illegal_xml_chars_RE = re.compile(u'[%s]' % u''.join(_illegal_ranges))

p.s. See this post on surrogates explaining what they are for .

Update so as to not to match (replace) 0x0D which is a valid XML character.

answered Sep 20 '22 00:09

Olemis Lang

Related questions
                            
                                Generic one-to-one relation in Django
                            
                                Exit gracefully if file doesn't exist
                            
                                numpy: Invalid value encountered in true_divide
                            
                                Debugging python in Atom?
                            
                                Pandas Dataframe to Code
                            
                                Python module for parametric CAD
                            
                                How to use numpy with 'None' value in Python?
                            
                                Django Admin filter on Foreign Key property
                            
                                Python distutils error: "[directory]... doesn't exist or not a regular file"
                            
                                Checking for interactive shell in a Python script
                            
                                pycurl and SSL cert
                            
                                How come when I press the Up or Down Arrow keys in the Python interpreter I get ^[[A or ^[[B instead of history? [duplicate]
                            
                                Using python multiprocessing with different random seed for each process
                            
                                How to run ipython with pypy?
                            
                                Redirecting stdout and stderr to a PyQt4 QTextEdit from a secondary thread
                            
                                How to Copy Files Fast [duplicate]
                            
                                virtualenv does not include pip
                            
                                Convert to date using formatters parameter in pandas to_string
                            
                                How to pivot on multiple columns in Spark SQL?
                            
                                Why neural network predicts wrong on its own training data?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fast way to filter illegal xml unicode chars in python?

Tags:

python

xml

unicode

itsadok

People also ask

1 Answers

Olemis Lang

Recent Activity

Donate For Us