How can I convert surrogate pairs to normal string in Python?

Tags:

This is a follow-up to Converting to Emoji. In that question, the OP had a json.dumps()-encoded file with an emoji represented as a surrogate pair - \ud83d\ude4f. S/he was having problems reading the file and translating the emoji correctly, and the correct answer was to json.loads() each line from the file, and the json module would handle the conversion from surrogate pair back to (I'm assuming UTF8-encoded) emoji.

So here is my situation: say I have just a regular Python 3 unicode string with a surrogate pair in it:

emoji = "This is \ud83d\ude4f, an emoji."

How do I process this string to get a representation of the emoji out of it? I'm looking to get something like this:

"This is 🙏, an emoji." # or "This is \U0001f64f, an emoji."

I've tried:

print(emoji) print(emoji.encode("utf-8")) # also tried "ascii", "utf-16", and "utf-16-le" json.loads(emoji) # and `.encode()` with various codecs

Generally I get an error similar to UnicodeEncodeError: XXX codec can't encode character '\ud83d' in position 8: surrogates no allowed.

I'm running Python 3.5.1 on Linux, with $LANG set to en_US.UTF-8. I've run these samples both in the Python interpreter on the command line, and within IPython running in Sublime Text - there don't appear to be any differences.

452

asked Jul 01 '16 13:07

MattDMo

2 Answers

You've mixed a literal string \ud83d in a json file on disk (six characters: \ u d 8 3 d) and a single character u'\ud83d' (specified using a string literal in Python source code) in memory. It is the difference between len(r'\ud83d') == 6 and len('\ud83d') == 1 on Python 3.

If you see '\ud83d\ude4f' Python string (2 characters) then there is a bug upstream. Normally, you shouldn't get such string. If you get one and you can't fix upstream that generates it; you could fix it using surrogatepass error handler:

>>> "\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16') '🙏'

Python 2 was more permissive.

Note: even if your json file contains literal \ud83d\ude4f (12 characters); you shouldn't get the surrogate pair:

>>> print(ascii(json.loads(r'"\ud83d\ude4f"'))) '\U0001f64f'

Notice: the result is 1 character ( '\U0001f64f'), not the surrogate pair ('\ud83d\ude4f').

188

answered Sep 28 '22 02:09

jfs

Because this is a recurring question and the error message is slightly obscure, here is a more detailed explanation.

Surrogates are a way to express Unicode code points bigger than U+FFFF.

Recall that Unicode was originally specified to contain 65,536 characters, but that it was soon found that this was not enough to accommodate all the glyphs of the world.

As an extension mechanism for the (otherwise fixed-width) UTF-16 encoding, a reserved area was set up to contain a mechanism for expressing code points outside the Basic Multilingual Plane: Any code point in this special area would have to be followed by another character code from the same area, and together, they would express a code point with a number larger than the old limit.

(Strictly speaking, the surrogates area is divided into two halves; the first surrogate in a pair needs to come from the High Surrogates half, and the second, from the Low Surrogates. Confusingly, the High Surrogates U+D800-U+DBFF have lower code point numbers than the Low Surrogates U+DC00-U+DFFF.)

This is a legacy mechanism to support the UTF-16 encoding specifically, and should not be used in other encodings; they do not need it, and the applicable standards specifically say that this is disallowed.

In other words, while U+12345 can be expressed with the surrogate pair U+D808 U+DF45, you should simply express it directly instead unless you are specifically using UTF-16.

In some more detail, here is how this would be expressed in UTF-8 as a single character:

0xF0 0x92 0x8D 0x85

And here is the corresponding surrogate sequence:

0xED 0xA0 0x88 0xED 0xBD 0x85

As already suggested in the accepted answer, you can round-trip with something like

>>> "\ud808\udf45".encode('utf-16', 'surrogatepass').decode('utf-16').encode('utf-8') b'\xf0\x92\x8d\x85'

Perhaps see also http://www.russellcottrell.com/greek/utilities/surrogatepaircalculator.htm

answered Sep 28 '22 01:09

tripleee

Related questions
                            
                                os.getcwd() vs os.path.abspath(os.path.dirname(__file__))
                            
                                Reshape an array in NumPy
                            
                                Get the directory path of absolute file path in Python
                            
                                how to print contents of PYTHONPATH
                            
                                Convert Python list to pandas Series
                            
                                Any way to reset a mocked method to its original state? - Python Mock - mock 1.0b1
                            
                                How to detect lowercase letters in Python?
                            
                                python logging module is not writing anything to file
                            
                                Is there a Python equivalent for Scala's Option or Either?
                            
                                How to use numpy.void type
                            
                                PyTorch memory model: "torch.from_numpy()" vs "torch.Tensor()"
                            
                                Multiple mod_wsgi apps on one virtual host directing to wrong app
                            
                                Function not changing global variable
                            
                                Finding the indices of matching elements in list in Python
                            
                                Passing list-likes to .loc or [] with any missing labels is no longer supported
                            
                                how to test if one python module has been imported?
                            
                                Zipping lists of unequal size
                            
                                matplotlib imshow - default colour normalisation
                            
                                How Do I Use Raw Socket in Python?
                            
                                What's the reverse of shlex.split?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I convert surrogate pairs to normal string in Python?

Tags:

python

python-3.x

unicode

surrogate-pairs

MattDMo

People also ask

2 Answers

jfs

tripleee

Recent Activity

Donate For Us