I am using python 2.7.10 on a Mac. Flags in emoji are indicated by a pair of Regional Indicator Symbols. I would like to write a python regex to insert spaces between a string of emoji flags. <ul> <li> For example, this string is two Brazilian flags: <ul> <li> <code>u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7"</code> </li> <li> which will render like this: 🇧🇷🇧🇷 </li> </ul> </li> </ul> I'd like to insert spaces between any pair of regional indicator symbols. Something like this: <pre class="prettyprint"><code>re.sub(re.compile(u"([\U0001F1E6-\U0001F1FF][\U0001F1E6-\U0001F1FF])"), r"\1 ", u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7") </code></pre> ...which would result in: <pre class="prettyprint"><code>u"\U0001F1E7\U0001F1F7 \U0001F1E7\U0001F1F7 " </code></pre> ...but that code gives me an error: <pre class="prettyprint"><code>sre_constants.error: bad character range </code></pre> A hint (I think) at what's going wrong is the following, which shows that \U0001F1E7 is turning into two "characters" in the regex: <pre class="prettyprint"><code>re.search(re.compile(u"([\U0001F1E7])"), u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7").group(0) </code></pre> This results in: <pre class="prettyprint"><code>u'\ud83c' </code></pre> Sadly my understanding of unicode is too weak for me to make further progress.

I believe you're using Python 2.7 in Windows or Mac, which has the narrow 16-bit Unicode build - Linux/Glibc usually have 32-bit full unicode, also Python 3.5 has wide Unicode on all platforms. What you see is the one code being split into a surrogate pair. Unfortunately it also means that you cannot use a single character class easily for this task. However it is still possible. The UTF-16 representation of U+1F1E6 (🇦) is <code>\uD83C\uDDE6</code>, and that of U+1F1FF (🇿) is <code>\uD83C\uDDFF</code>. I do not even have an access to such Python build at all, but you could try <pre class="prettyprint"><code>\uD83C[\uDDE6-\uDDFF] </code></pre> as a replacement for single <code>[\U0001F1E6-\U0001F1FF]</code>, thus your whole regex would be <pre class="prettyprint"><code>(\uD83C[\uDDE6-\uDDFF]\uD83C[\uDDE6-\uDDFF]) </code></pre> The reason why the character class doesn't work is that it tries to make a range from the second half of the first surrogate pair to the first half of the second surrogate pair - this fails, because the start of the range is lexicographically greater than the end. However, this regular expression still wouldn't work on Linux, you need to use the original there as Linux builds use wide unicode by default. <hr> Alternatively, upgrade your Windows Python to 3.5 or better.

A python regex that matches the regional indicator character class

Tags:

python

regex

unicode

python-2.x

regional

I am using python 2.7.10 on a Mac. Flags in emoji are indicated by a pair of Regional Indicator Symbols. I would like to write a python regex to insert spaces between a string of emoji flags.

For example, this string is two Brazilian flags:
- u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7"
- which will render like this: 🇧🇷🇧🇷

I'd like to insert spaces between any pair of regional indicator symbols. Something like this:

re.sub(re.compile(u"([\U0001F1E6-\U0001F1FF][\U0001F1E6-\U0001F1FF])"),
       r"\1 ", 
       u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7")

...which would result in:

u"\U0001F1E7\U0001F1F7 \U0001F1E7\U0001F1F7 "

...but that code gives me an error:

sre_constants.error: bad character range

A hint (I think) at what's going wrong is the following, which shows that \U0001F1E7 is turning into two "characters" in the regex:

re.search(re.compile(u"([\U0001F1E7])"),
          u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7").group(0)

This results in:

u'\ud83c'

Sadly my understanding of unicode is too weak for me to make further progress.

575

asked Aug 23 '16 18:08

John Rauser

1 Answers

I believe you're using Python 2.7 in Windows or Mac, which has the narrow 16-bit Unicode build - Linux/Glibc usually have 32-bit full unicode, also Python 3.5 has wide Unicode on all platforms.

What you see is the one code being split into a surrogate pair. Unfortunately it also means that you cannot use a single character class easily for this task. However it is still possible. The UTF-16 representation of U+1F1E6 (🇦) is \uD83C\uDDE6, and that of U+1F1FF (🇿) is \uD83C\uDDFF.

I do not even have an access to such Python build at all, but you could try

\uD83C[\uDDE6-\uDDFF]

as a replacement for single [\U0001F1E6-\U0001F1FF], thus your whole regex would be

(\uD83C[\uDDE6-\uDDFF]\uD83C[\uDDE6-\uDDFF])

The reason why the character class doesn't work is that it tries to make a range from the second half of the first surrogate pair to the first half of the second surrogate pair - this fails, because the start of the range is lexicographically greater than the end.

However, this regular expression still wouldn't work on Linux, you need to use the original there as Linux builds use wide unicode by default.

Alternatively, upgrade your Windows Python to 3.5 or better.

158

answered Sep 22 '22 13:09

Antti Haapala -- Слава Україні

Related questions
                            
                                Sort memoryview in Cython
                            
                                Redis - how to RPUSH/LPUSH an empty list
                            
                                Python mysql.connector InternalError: Unread result found when close cursor
                            
                                Django Form validation overview (quick!)
                            
                                Pandas split name column into first and last name if contains one space
                            
                                checking if a letter is present in a string in python [duplicate]
                            
                                How to set background color, title in Plotly (python)?
                            
                                python getattr built-in method executes default arguments
                            
                                sum up two pandas dataframes with different indexes element by element
                            
                                A transition from CountVectorizer to TfidfTransformer in sklearn
                            
                                What's convention for naming a class or method as "class" in Python?
                            
                                How to use a Seafile generated upload-link w/o authentication token from command line
                            
                                View of a view of a numpy array is a copy?
                            
                                How to sort rows in pandas with a non-standard order
                            
                                pywinauto.findwindows.WindowNotFoundError in pywinauto
                            
                                "ImportError: No module named urls" while following Django Tutorial
                            
                                Partial sums and subtotals with Pandas
                            
                                How to replace an re match with a transformation of that match?
                            
                                Three variables as heatmap
                            
                                Difference between <type 'generator'> and <type 'xrange'>

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With