I have a regex to detect invalid xml 1.0 characters in a unicode string: <pre class="prettyprint"><code>bad_xml_chars = re.compile(u'[^\x09\x0A\x0D\u0020-\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF]', re.U) </code></pre> On Linux/python2.7, this works perfectly. On windows the following is raised: <pre class="prettyprint"><code> File "C:\Python27\lib\re.py", line 190, in compile return _compile(pattern, flags) File "C:\Python27\lib\re.py", line 242, in _compile raise error, v # invalid expression sre_constants.error: bad character range </code></pre> Any ideas why this isn't compiling on Windows?

You have a narrow Python build on Windows, so Unicode uses UTF-16. This means that Unicode characters higher than <code>\uFFFF</code> will be two separate characters in the Python string. You should see something like this: <pre class="prettyprint"><code>>>> len(u'\U00010000') 2 >>> u'\U00010000'[0] u'\ud800' >>> u'\U00010000'[1] u'\udc00' </code></pre> Here is how the regex engine will attempt to interpret your string on narrow builds: <pre class="prettyprint lang-none prettyprint-override"><code>[^\x09\x0A\x0D\u0020-\ud7ff\ue000-\ufffd\ud800\udc00-\udbff\udfff] </code></pre> You can see here that <code>\udc00-\udbff</code> is where the invalid range message is coming from.

Why would a python regex compile on Linux but not Windows?

Tags:

python

regex

I have a regex to detect invalid xml 1.0 characters in a unicode string:

bad_xml_chars = re.compile(u'[^\x09\x0A\x0D\u0020-\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF]', re.U)

On Linux/python2.7, this works perfectly. On windows the following is raised:

  File "C:\Python27\lib\re.py", line 190, in compile
    return _compile(pattern, flags)
  File "C:\Python27\lib\re.py", line 242, in _compile
    raise error, v # invalid expression
  sre_constants.error: bad character range

Any ideas why this isn't compiling on Windows?

465

asked Dec 13 '12 17:12

UsAaR33

1 Answers

You have a narrow Python build on Windows, so Unicode uses UTF-16. This means that Unicode characters higher than \uFFFF will be two separate characters in the Python string. You should see something like this:

>>> len(u'\U00010000')
2
>>> u'\U00010000'[0]
u'\ud800'
>>> u'\U00010000'[1]
u'\udc00'

Here is how the regex engine will attempt to interpret your string on narrow builds:

[^\x09\x0A\x0D\u0020-\ud7ff\ue000-\ufffd\ud800\udc00-\udbff\udfff]

You can see here that \udc00-\udbff is where the invalid range message is coming from.

answered Oct 05 '22 13:10

Andrew Clark

Related questions
                            
                                Get longest element in Dict
                            
                                Python Formatting Large Text
                            
                                Opening pdf urls with pyPdf
                            
                                How to change variables fed into a for loop in list form
                            
                                How can I communicate between a Siemens S7-1200 and python?
                            
                                Why can't I end a raw string with a backslash? [duplicate]
                            
                                Why does zip() drop the values of my generator?
                            
                                Tkinter askquestion dialog box
                            
                                zen of Python vs with statement - philosophical pondering
                            
                                Recursive generator for flattening nested lists
                            
                                How to find the list in a list of lists whose sum of elements is the greatest?
                            
                                Django error in Heroku: "Please supply the ENGINE value"
                            
                                Saving dictionary whose keys are tuples with json, python
                            
                                Python - test whether object is a builtin function
                            
                                Compile Python 2.7.3 from source on a system with Python 2.7 already
                            
                                How do I compute all possibilities for an array of numbers/bits (in python, or any language for that matter)
                            
                                Multiprocessing scikit-learn
                            
                                urllib2 HTTP error 429
                            
                                Get all the layers in a packet
                            
                                how to combine two columns with an if/else in python pandas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With