Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A python regex that matches the regional indicator character class

I am using python 2.7.10 on a Mac. Flags in emoji are indicated by a pair of Regional Indicator Symbols. I would like to write a python regex to insert spaces between a string of emoji flags.

  • For example, this string is two Brazilian flags:

    • u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7"

    • which will render like this: 🇧🇷🇧🇷

I'd like to insert spaces between any pair of regional indicator symbols. Something like this:

re.sub(re.compile(u"([\U0001F1E6-\U0001F1FF][\U0001F1E6-\U0001F1FF])"),
       r"\1 ", 
       u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7")

...which would result in:

u"\U0001F1E7\U0001F1F7 \U0001F1E7\U0001F1F7 "

...but that code gives me an error:

sre_constants.error: bad character range

A hint (I think) at what's going wrong is the following, which shows that \U0001F1E7 is turning into two "characters" in the regex:

re.search(re.compile(u"([\U0001F1E7])"),
          u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7").group(0)

This results in:

u'\ud83c'

Sadly my understanding of unicode is too weak for me to make further progress.

like image 575
John Rauser Avatar asked Aug 23 '16 18:08

John Rauser


People also ask

How do you match a character in python?

. (a period) -- matches any single character except newline '\n' \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word.

How do you use special characters in regex python?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

Which regex matches only a whitespace character in Python?

\s | Matches whitespace characters, which include the \t , \n , \r , and space characters.


1 Answers

I believe you're using Python 2.7 in Windows or Mac, which has the narrow 16-bit Unicode build - Linux/Glibc usually have 32-bit full unicode, also Python 3.5 has wide Unicode on all platforms.

What you see is the one code being split into a surrogate pair. Unfortunately it also means that you cannot use a single character class easily for this task. However it is still possible. The UTF-16 representation of U+1F1E6 (🇦) is \uD83C\uDDE6, and that of U+1F1FF (🇿) is \uD83C\uDDFF.

I do not even have an access to such Python build at all, but you could try

\uD83C[\uDDE6-\uDDFF]

as a replacement for single [\U0001F1E6-\U0001F1FF], thus your whole regex would be

(\uD83C[\uDDE6-\uDDFF]\uD83C[\uDDE6-\uDDFF])

The reason why the character class doesn't work is that it tries to make a range from the second half of the first surrogate pair to the first half of the second surrogate pair - this fails, because the start of the range is lexicographically greater than the end.

However, this regular expression still wouldn't work on Linux, you need to use the original there as Linux builds use wide unicode by default.


Alternatively, upgrade your Windows Python to 3.5 or better.