Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I represent this regex to not get a "bad character range" error?

Tags:

python

regex

Is there a better way to do this?

$ python
Python 2.7.9 (default, Jul 16 2015, 14:54:10)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-55)] on linux2

Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.sub(u'[\U0001d300-\U0001d356]', "", "")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/fast/services/lib/python2.7/re.py", line 155, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/home/fast/services/lib/python2.7/re.py", line 251, in _compile
    raise error, v # invalid expression
sre_constants.error: bad character range
like image 650
chaimp Avatar asked Jul 24 '15 05:07

chaimp


1 Answers

Python narrow and wide build (Python versions below 3.3)

The error suggests that you are using "narrow" (UCS-2) build, which only supports Unicode code points up to 65535 as one "Unicode character"1. Characters whose code points are above 65536 are represented as surrogate pairs, which means that the Unicode string u'\U0001d300' consists of two "Unicode character" in narrow build.

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> import sys; sys.maxunicode
65535
>>> len(u'\U0001d300')
2
>>> [hex(ord(i)) for i in u'\U0001d300']
['0xd834', '0xdf00']

In "wide" (UCS-4) build, all 1114111 code points are recognized as Unicode character, so the Unicode string u'\U0001d300' consists of exactly one "Unicode character"/Unicode code point.

Python 2.6.6 (r266:84292, May  1 2012, 13:52:17)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
>>> import sys; sys.maxunicode
1114111
>>> len(u'\U0001d300')
1
>>> [hex(ord(i)) for i in u'\U0001d300']
['0x1d300']

1 I use "Unicode character" (in quotes) to refer to one character in Python Unicode string, not one Unicode code point. The number of "Unicode characters" in a string is the len() of the string. In "narrow" build, one "Unicode character" is a 16-bit code unit of UTF-16, so one astral character will appear as two "Unicode character". In "wide" build, one "Unicode character" always corresponds to one Unicode code point.

Matching astral plane characters with regex

Wide build

The regex in the question compiles correctly in "wide" build:

Python 2.6.6 (r266:84292, May  1 2012, 13:52:17)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
>>> import re; re.compile(u'[\U0001d300-\U0001d356]', re.DEBUG)
in
  range (119552, 119638)
<_sre.SRE_Pattern object at 0x7f9f110386b8>

Narrow build

However, the same regex won't work in "narrow" build, since the engine does not recognize surrogate pairs. It just treats \ud834 as one character, then tries to create a character range from \udf00 to \ud834 and fails.

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> [hex(ord(i)) for i in u'[\U0001d300-\U0001d356]']
['0x5b', '0xd834', '0xdf00', '0x2d', '0xd834', '0xdf56', '0x5d']

The workaround is to use the same method as done in ECMAScript, where we will construct the regex to match the surrogates representing the code point.

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> import re; re.compile(u'\ud834[\udf00-\udf56]', re.DEBUG)
literal 55348
in
  range (57088, 57174)
<_sre.SRE_Pattern object at 0x6ffffe52210>
>>> input =  u'Sample \U0001d340. Another \U0001d305. Leave alone \U00011000'
>>> input
u'Sample \U0001d340. Another \U0001d305. Leave alone \U00011000'
>>> re.sub(u'\ud834[\udf00-\udf56]', '', input)
u'Sample . Another . Leave alone \U00011000'

Using regexpu to derive astral plane regex for Python narrow build

Since the construction to match astral plane characters in Python narrow build is the same as ES5, you can use regexpu, a tool to convert RegExp literal in ES6 to ES5, to do the conversion for you.

Just paste the equivalent regex in ES6 (note the u flag and \u{hh...h} syntax):

/[\u{1d300}-\u{1d356}]/u

and you get back the regex which can be used in both Python narrow build and ES5

/(?:\uD834[\uDF00-\uDF56])/

Do take note to remove the delimiter / in JavaScript RegExp literal when you want to use the regex in Python.

The tool is extremely useful when the range spread across multiple high surrogates (U+D800 to U+DBFF). For example, if we have to match the character range

/[\u{105c0}-\u{1cb40}]/u

The equivalent regex in Python narrow build and ES5 is

/(?:\uD801[\uDDC0-\uDFFF]|[\uD802-\uD831][\uDC00-\uDFFF]|\uD832[\uDC00-\uDF40])/

which is rather complex and error-prone to derive.

Python version 3.3 and above

Python 3.3 implements PEP 393, which eliminates the distinction between narrow build and wide build, and Python from now behaves like a wide build. This eliminates the problem in the question altogether.

Compatibility issues

While it's possible to workaround and match astral plane characters in Python narrow builds, going forward, it's best to change the execution environment by using Python wide builds, or port the code to use with Python 3.3 and above.

The regex code for narrow build is hard to read and maintain for average programmers, and it has to be completely rewritten when porting to Python 3.

Reference

  • How to find out if Python is compiled with UCS-2 or UCS-4?
like image 194
nhahtdh Avatar answered Oct 21 '22 05:10

nhahtdh