Is there a better way to do this?
$ python
Python 2.7.9 (default, Jul 16 2015, 14:54:10)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-55)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.sub(u'[\U0001d300-\U0001d356]', "", "")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/fast/services/lib/python2.7/re.py", line 155, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/home/fast/services/lib/python2.7/re.py", line 251, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range
The error suggests that you are using "narrow" (UCS-2) build, which only supports Unicode code points up to 65535 as one "Unicode character"1. Characters whose code points are above 65536 are represented as surrogate pairs, which means that the Unicode string u'\U0001d300'
consists of two "Unicode character" in narrow build.
Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> import sys; sys.maxunicode
65535
>>> len(u'\U0001d300')
2
>>> [hex(ord(i)) for i in u'\U0001d300']
['0xd834', '0xdf00']
In "wide" (UCS-4) build, all 1114111 code points are recognized as Unicode character, so the Unicode string u'\U0001d300'
consists of exactly one "Unicode character"/Unicode code point.
Python 2.6.6 (r266:84292, May 1 2012, 13:52:17)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
>>> import sys; sys.maxunicode
1114111
>>> len(u'\U0001d300')
1
>>> [hex(ord(i)) for i in u'\U0001d300']
['0x1d300']
1 I use "Unicode character" (in quotes) to refer to one character in Python Unicode string, not one Unicode code point. The number of "Unicode characters" in a string is the len()
of the string. In "narrow" build, one "Unicode character" is a 16-bit code unit of UTF-16, so one astral character will appear as two "Unicode character". In "wide" build, one "Unicode character" always corresponds to one Unicode code point.
The regex in the question compiles correctly in "wide" build:
Python 2.6.6 (r266:84292, May 1 2012, 13:52:17)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
>>> import re; re.compile(u'[\U0001d300-\U0001d356]', re.DEBUG)
in
range (119552, 119638)
<_sre.SRE_Pattern object at 0x7f9f110386b8>
However, the same regex won't work in "narrow" build, since the engine does not recognize surrogate pairs. It just treats \ud834
as one character, then tries to create a character range from \udf00
to \ud834
and fails.
Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> [hex(ord(i)) for i in u'[\U0001d300-\U0001d356]']
['0x5b', '0xd834', '0xdf00', '0x2d', '0xd834', '0xdf56', '0x5d']
The workaround is to use the same method as done in ECMAScript, where we will construct the regex to match the surrogates representing the code point.
Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> import re; re.compile(u'\ud834[\udf00-\udf56]', re.DEBUG)
literal 55348
in
range (57088, 57174)
<_sre.SRE_Pattern object at 0x6ffffe52210>
>>> input = u'Sample \U0001d340. Another \U0001d305. Leave alone \U00011000'
>>> input
u'Sample \U0001d340. Another \U0001d305. Leave alone \U00011000'
>>> re.sub(u'\ud834[\udf00-\udf56]', '', input)
u'Sample . Another . Leave alone \U00011000'
Since the construction to match astral plane characters in Python narrow build is the same as ES5, you can use regexpu, a tool to convert RegExp literal in ES6 to ES5, to do the conversion for you.
Just paste the equivalent regex in ES6 (note the u
flag and \u{hh...h}
syntax):
/[\u{1d300}-\u{1d356}]/u
and you get back the regex which can be used in both Python narrow build and ES5
/(?:\uD834[\uDF00-\uDF56])/
Do take note to remove the delimiter /
in JavaScript RegExp literal when you want to use the regex in Python.
The tool is extremely useful when the range spread across multiple high surrogates (U+D800 to U+DBFF). For example, if we have to match the character range
/[\u{105c0}-\u{1cb40}]/u
The equivalent regex in Python narrow build and ES5 is
/(?:\uD801[\uDDC0-\uDFFF]|[\uD802-\uD831][\uDC00-\uDFFF]|\uD832[\uDC00-\uDF40])/
which is rather complex and error-prone to derive.
Python 3.3 implements PEP 393, which eliminates the distinction between narrow build and wide build, and Python from now behaves like a wide build. This eliminates the problem in the question altogether.
While it's possible to workaround and match astral plane characters in Python narrow builds, going forward, it's best to change the execution environment by using Python wide builds, or port the code to use with Python 3.3 and above.
The regex code for narrow build is hard to read and maintain for average programmers, and it has to be completely rewritten when porting to Python 3.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With