How do I specify a range of unicode characters from <code>' '</code> (space) to <code>\u00D7FF</code>? I have a regular expression like <code>r'[\u0020-\u00D7FF]'</code> and it won't compile saying that it's a bad range. I am new to Unicode regular expressions so I haven't had this problem before. Is there a way to make this compile or a regular expression that I'm forgetting or haven't learned yet?

The syntax of your unicode range will not do what you expect. <ol> <li> The raw <code>r''</code> string prevents <code>\u</code> escapes from being parsed, and the regex engine will not do this. The only range in this set is <code>[0-\]</code>: <pre class="prettyprint"><code>>>> re.compile(r'[\u0020-\u00d7ff]', re.DEBUG) in literal 117 literal 48 literal 48 literal 50 range (48, 117) literal 48 literal 48 literal 100 literal 55 literal 102 literal 102 </code></pre> </li> <li> Making it a Unicode literal causes <code>\u</code> parsing while leaving other backslashes alone (although that’s not a concern here), but the leading zeroes are messing it up. The syntax is <code>\uxxxx</code> or <code>\Uxxxxxxxx</code>, so it’s parsed as "<code>\u00d7</code>, <code>f</code>, <code>f</code>". <pre class="prettyprint"><code>>>> re.compile(ur'[\u0020-\u00d7ff]', re.DEBUG) in range (32, 215) literal 102 literal 102 </code></pre> </li> <li> Removing the leading zeroes or switching to <code>\U0000d7ff</code> will fix it: <pre class="prettyprint"><code>>>> re.compile(ur'[\u0020-\ud7ff]', re.DEBUG) in range (32, 55295) </code></pre> </li> </ol>

How do I specify a range of unicode characters

1 Answers

The syntax of your unicode range will not do what you expect.

The raw r'' string prevents \u escapes from being parsed, and the regex engine will not do this. The only range in this set is [0-\]:

>>> re.compile(r'[\u0020-\u00d7ff]', re.DEBUG) in   literal 117   literal 48   literal 48   literal 50   range (48, 117)   literal 48   literal 48   literal 100   literal 55   literal 102   literal 102

Making it a Unicode literal causes \u parsing while leaving other backslashes alone (although that’s not a concern here), but the leading zeroes are messing it up. The syntax is \uxxxx or \Uxxxxxxxx, so it’s parsed as "\u00d7, f, f".
```
>>> re.compile(ur'[\u0020-\u00d7ff]', re.DEBUG) in   range (32, 215)   literal 102   literal 102 
```

Removing the leading zeroes or switching to \U0000d7ff will fix it:

>>> re.compile(ur'[\u0020-\ud7ff]', re.DEBUG) in   range (32, 55295)

118

answered Oct 05 '22 23:10

Josh Lee

Related questions
                            
                                Why are assignments not allowed in Python's `lambda` expressions?
                            
                                AttributeError: module 'asyncio' has no attribute 'create_task'
                            
                                conda environment has no name visible in conda env list - how do I activate it at the shell?
                            
                                Python optparse metavar
                            
                                What is the difference between a site and an app in Django?
                            
                                Highlighting python stack traces
                            
                                What is the difference between sys and os.sys
                            
                                In Python, what is the difference between an object and a dictionary?
                            
                                How can I copy files bigger than 5 GB in Amazon S3?
                            
                                Python matplotlib change default color for values exceeding colorbar range
                            
                                How to use multiprocessing with class instances in Python?
                            
                                python - OpenCV mat::convertTo in python
                            
                                What are the parameters for sklearn's score function?
                            
                                Keeping NaN values and dropping nonmissing values
                            
                                How to convert a 16 bit to an 8 bit image in OpenCV?
                            
                                Python: yield and yield assignment
                            
                                Installing anaconda over existing python system?
                            
                                How to properly mask a numpy 2D array?
                            
                                Querying with function on Flask-SQLAlchemy model gives BaseQuery object is not callable error
                            
                                How to get the latest frame from capture device (camera) in opencv

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I specify a range of unicode characters

Tags:

python

regex

unicode

spig

People also ask

1 Answers

Josh Lee

Recent Activity

Donate For Us