How do I specify a range of unicode characters from ' '
(space) to \u00D7FF
?
I have a regular expression like r'[\u0020-\u00D7FF]'
and it won't compile saying that it's a bad range. I am new to Unicode regular expressions so I haven't had this problem before.
Is there a way to make this compile or a regular expression that I'm forgetting or haven't learned yet?
As of Unicode version 14.0, there are 144,697 characters with code points, covering 159 modern and historical scripts, as well as multiple symbol sets.
Values. Any unicode character code or range is an acceptable unicode-range value. You will notice that unicode points are preceded by a U+ followed by up to six characters that make up the character code. Points or ranges that do not follow this format are considered invalid and will cause the property to be ignored.
To show a range of characters, use square backets and separate the starting character from the ending character with a hyphen. For example, [0-9] matches any digit. Several ranges can be put inside square brackets. For example, [A-CX-Z] matches 'A' or 'B' or 'C' or 'X' or 'Y' or 'Z'.
To be specific: a range is a contiguous series of characters, from low to high, in the ASCII character set. [101] For example, [z-a] is not a range because it's backwards.
A range of Unicode code points. So for example, U+0025-00FF means include all characters in the range U+0025 to U+00FF. A range of Unicode code points containing wildcard characters, that is using the '?' character, so for example U+4?? means include all characters in the range U+400 to U+4FF.
You can adjust the interval of generated Unicode characters by specifying three parameters for it – the starting code point, the increment, and the count. The starting code point (in hex format) sets the first Unicode character of the range. The increment (also in hex format) is the difference between the following Unicode code points.
Each Unicode value has a unique numerical value called a code point or a code position. There are 1,114,112 code positions right now, in the interval from 0x0 to 0x10FFFF (in base-16). You can adjust the interval of generated Unicode characters by specifying three parameters for it – the starting code point, the increment, and the count.
So for example, U+0025-00FF means include all characters in the range U+0025 to U+00FF. A range of Unicode code points containing wildcard characters, that is using the '?' character, so for example U+4?? means include all characters in the range U+400 to U+4FF.
The syntax of your unicode range will not do what you expect.
The raw r''
string prevents \u
escapes from being parsed, and the regex engine will not do this. The only range in this set is [0-\]
:
>>> re.compile(r'[\u0020-\u00d7ff]', re.DEBUG) in literal 117 literal 48 literal 48 literal 50 range (48, 117) literal 48 literal 48 literal 100 literal 55 literal 102 literal 102
Making it a Unicode literal causes \u
parsing while leaving other backslashes alone (although that’s not a concern here), but the leading zeroes are messing it up. The syntax is \uxxxx
or \Uxxxxxxxx
, so it’s parsed as "\u00d7
, f
, f
".
>>> re.compile(ur'[\u0020-\u00d7ff]', re.DEBUG) in range (32, 215) literal 102 literal 102
Removing the leading zeroes or switching to \U0000d7ff
will fix it:
>>> re.compile(ur'[\u0020-\ud7ff]', re.DEBUG) in range (32, 55295)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With