Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I specify a range of unicode characters

How do I specify a range of unicode characters from ' ' (space) to \u00D7FF?

I have a regular expression like r'[\u0020-\u00D7FF]' and it won't compile saying that it's a bad range. I am new to Unicode regular expressions so I haven't had this problem before.

Is there a way to make this compile or a regular expression that I'm forgetting or haven't learned yet?

like image 568
spig Avatar asked Oct 01 '10 01:10

spig


People also ask

What is the range of Unicode characters?

As of Unicode version 14.0, there are 144,697 characters with code points, covering 159 modern and historical scripts, as well as multiple symbol sets.

Can I use Unicode-range?

Values. Any unicode character code or range is an acceptable unicode-range value. You will notice that unicode points are preceded by a U+ followed by up to six characters that make up the character code. Points or ranges that do not follow this format are considered invalid and will cause the property to be ignored.

How do you range a character in regex?

To show a range of characters, use square backets and separate the starting character from the ending character with a hyphen. For example, [0-9] matches any digit. Several ranges can be put inside square brackets. For example, [A-CX-Z] matches 'A' or 'B' or 'C' or 'X' or 'Y' or 'Z'.

What is a range of characters?

To be specific: a range is a contiguous series of characters, from low to high, in the ASCII character set. [101] For example, [z-a] is not a range because it's backwards.

What is a range of Unicode code points?

A range of Unicode code points. So for example, U+0025-00FF means include all characters in the range U+0025 to U+00FF. A range of Unicode code points containing wildcard characters, that is using the '?' character, so for example U+4?? means include all characters in the range U+400 to U+4FF.

How do I adjust the interval of generated Unicode characters?

You can adjust the interval of generated Unicode characters by specifying three parameters for it – the starting code point, the increment, and the count. The starting code point (in hex format) sets the first Unicode character of the range. The increment (also in hex format) is the difference between the following Unicode code points.

How many Unicode characters are there?

Each Unicode value has a unique numerical value called a code point or a code position. There are 1,114,112 code positions right now, in the interval from 0x0 to 0x10FFFF (in base-16). You can adjust the interval of generated Unicode characters by specifying three parameters for it – the starting code point, the increment, and the count.

What does U+0025-00ff mean in Unicode?

So for example, U+0025-00FF means include all characters in the range U+0025 to U+00FF. A range of Unicode code points containing wildcard characters, that is using the '?' character, so for example U+4?? means include all characters in the range U+400 to U+4FF.


1 Answers

The syntax of your unicode range will not do what you expect.

  1. The raw r'' string prevents \u escapes from being parsed, and the regex engine will not do this. The only range in this set is [0-\]:

    >>> re.compile(r'[\u0020-\u00d7ff]', re.DEBUG) in   literal 117   literal 48   literal 48   literal 50   range (48, 117)   literal 48   literal 48   literal 100   literal 55   literal 102   literal 102 
  2. Making it a Unicode literal causes \u parsing while leaving other backslashes alone (although that’s not a concern here), but the leading zeroes are messing it up. The syntax is \uxxxx or \Uxxxxxxxx, so it’s parsed as "\u00d7, f, f".

    >>> re.compile(ur'[\u0020-\u00d7ff]', re.DEBUG) in   range (32, 215)   literal 102   literal 102 
  3. Removing the leading zeroes or switching to \U0000d7ff will fix it:

    >>> re.compile(ur'[\u0020-\ud7ff]', re.DEBUG) in   range (32, 55295) 
like image 118
Josh Lee Avatar answered Oct 05 '22 23:10

Josh Lee