Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why would a python regex compile on Linux but not Windows?

Tags:

python

regex

I have a regex to detect invalid xml 1.0 characters in a unicode string:

bad_xml_chars = re.compile(u'[^\x09\x0A\x0D\u0020-\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF]', re.U)

On Linux/python2.7, this works perfectly. On windows the following is raised:

  File "C:\Python27\lib\re.py", line 190, in compile
    return _compile(pattern, flags)
  File "C:\Python27\lib\re.py", line 242, in _compile
    raise error, v # invalid expression
  sre_constants.error: bad character range

Any ideas why this isn't compiling on Windows?

like image 465
UsAaR33 Avatar asked Dec 13 '12 17:12

UsAaR33


People also ask

What does regex compile do in Python?

compile() method is used to compile a regular expression pattern provided as a string into a regex pattern object ( re. Pattern ). Later we can use this pattern object to search for a match inside different target strings using regex methods such as a re.

Does Python match regex?

match() function of re in Python will search the regular expression pattern and return the first occurrence. The Python RegEx Match method checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object.

What does regex compile to?

If a Regex object is constructed with the RegexOptions. Compiled option, it compiles the regular expression to explicit MSIL code instead of high-level regular expression internal instructions.

What are different types of regular expression in Python?

Python offers two different primitive operations based on regular expressions: match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string (this is what Perl does by default).


1 Answers

You have a narrow Python build on Windows, so Unicode uses UTF-16. This means that Unicode characters higher than \uFFFF will be two separate characters in the Python string. You should see something like this:

>>> len(u'\U00010000')
2
>>> u'\U00010000'[0]
u'\ud800'
>>> u'\U00010000'[1]
u'\udc00'

Here is how the regex engine will attempt to interpret your string on narrow builds:

[^\x09\x0A\x0D\u0020-\ud7ff\ue000-\ufffd\ud800\udc00-\udbff\udfff]

You can see here that \udc00-\udbff is where the invalid range message is coming from.

like image 93
Andrew Clark Avatar answered Oct 05 '22 13:10

Andrew Clark