I am using python 2.7.10 on a Mac. Flags in emoji are indicated by a pair of Regional Indicator Symbols. I would like to write a python regex to insert spaces between a string of emoji flags.
For example, this string is two Brazilian flags:
u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7"
which will render like this: 🇧🇷🇧🇷
I'd like to insert spaces between any pair of regional indicator symbols. Something like this:
re.sub(re.compile(u"([\U0001F1E6-\U0001F1FF][\U0001F1E6-\U0001F1FF])"),
r"\1 ",
u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7")
...which would result in:
u"\U0001F1E7\U0001F1F7 \U0001F1E7\U0001F1F7 "
...but that code gives me an error:
sre_constants.error: bad character range
A hint (I think) at what's going wrong is the following, which shows that \U0001F1E7 is turning into two "characters" in the regex:
re.search(re.compile(u"([\U0001F1E7])"),
u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7").group(0)
This results in:
u'\ud83c'
Sadly my understanding of unicode is too weak for me to make further progress.
. (a period) -- matches any single character except newline '\n' \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
\s | Matches whitespace characters, which include the \t , \n , \r , and space characters.
I believe you're using Python 2.7 in Windows or Mac, which has the narrow 16-bit Unicode build - Linux/Glibc usually have 32-bit full unicode, also Python 3.5 has wide Unicode on all platforms.
What you see is the one code being split into a surrogate pair. Unfortunately it also means that you cannot use a single character class easily for this task. However it is still possible. The UTF-16 representation of U+1F1E6 (🇦) is \uD83C\uDDE6
, and that of U+1F1FF (🇿) is \uD83C\uDDFF
.
I do not even have an access to such Python build at all, but you could try
\uD83C[\uDDE6-\uDDFF]
as a replacement for single [\U0001F1E6-\U0001F1FF]
, thus your whole regex would be
(\uD83C[\uDDE6-\uDDFF]\uD83C[\uDDE6-\uDDFF])
The reason why the character class doesn't work is that it tries to make a range from the second half of the first surrogate pair to the first half of the second surrogate pair - this fails, because the start of the range is lexicographically greater than the end.
However, this regular expression still wouldn't work on Linux, you need to use the original there as Linux builds use wide unicode by default.
Alternatively, upgrade your Windows Python to 3.5 or better.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With