Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Escaping unicode strings in python

In python these three commands print the same emoji:

print "\xF0\x9F\x8C\x80"
šŸŒ€
print u"\U0001F300"
šŸŒ€
print u"\ud83c\udf00"
šŸŒ€

How can I translate between \x, \u and \U escaping? I can't figure how these hex numbers are equivalent?

like image 218
Jose G Avatar asked Mar 24 '15 04:03

Jose G


3 Answers

The first one is a byte string:

>>> "\xF0\x9F\x8C\x80".decode('utf8')
u'\U0001f300'

The u"\ud83c\udf00" one is the UTF16 version (four digit unicode escape)

The u"\U0001F300" one is actual index of the codepoint.


But how do the numbers relate? This is the difficult question. It's defined by the encoding and there is no obvious relationship. To give you an idea, here is an example of "manually" encoding the codepoint at index 0x1F300 into UTF-8:

The cyclone character šŸŒ€ has index 0x1f300 which falls into the range 0x00010000 - 0x001FFFFF. The template for this range is:

11110... 10...... 10...... 10......

Where you fill in the dots with the binary representation of the codepoint. I can't tell you why the template looks like that, it's just the utf-8 definition.

Here's the binary representation of our codepoint:

>>> u'šŸŒ€'
u'\U0001f300'
>>> unichr(0x1f300)
u'\U0001f300'
>>> bin(0x1f300)
'0b11111001100000000'

So if we take the string template and fill it up like this (with some leading zeros because there are more slots in the template than significant digits in our number) we get this:

11110... 10...... 10...... 10......
11110000 10011111 10001100 10000000

Now let's convert that back to hex

>>> 0b11110000100111111000110010000000
4036988032
>>> hex(4036988032)
'0xf09f8c80'

And there you have the UTF8 representation of the codepoint.

For UTF16 there is a similar magic recipe for your codepoint: 0x10000 is subtracted from the index, and then we pad with zeros to get a 20-bit binary representation. The first ten bits are added to 0xD800 to give the first 16-bit code unit. The last ten bits are added to 0xDC00 to give the second 16-bit code unit.

>>> bin(0x1f300 - 0x10000)[2:].rjust(20, '0')
'00001111001100000000'
>>> _[:10], _[10:]
('0000111100', '1100000000')
>>> hex(0b0000111100 + 0xd800)
'0xd83c'
>>> hex(0b1100000000 + 0xdc00)
'0xdf00'

And there's your UTF 16 version, i.e. the one with the lowercase \u escape.

As you can probably understand there may be no obvious numerical relationship between the hex digits in these representations, they are just different encodings of the same code point.

like image 141
wim Avatar answered Sep 22 '22 07:09

wim


Your first string is a byte string. The fact that it prints a single emoji character means that your console is configured to print UTF-8 encoded characters.

Your second string is a Unicode string with a single codepoint, U+1F300. The \U specifies that the next 8 hex digits should be interpreted as a codepoint.

The third string takes advantage of a quirk in the way Unicode strings are stored in Python 2. You've given two UTF-16 entities, which together form the single codepoint U+1F300 the same as the previous string. Each \u takes 4 following hex digits. Individually these characters wouldn't be valid Unicode, but because Python 2 stores its Unicode internally as UTF-16 it works out. In Python 3 this wouldn't be valid.

When you print out a Unicode string, and your console encoding is known to be UTF-8, the Unicode strings are encoded to UTF-8 bytes. Thus the 3 strings end up producing the same byte sequence on the output, generating the same character.

like image 36
Mark Ransom Avatar answered Sep 21 '22 07:09

Mark Ransom


See Unicode Literals in Python Source Code

In Python source code, Unicode literals are written as strings prefixed with the ā€˜uā€™ or ā€˜Uā€™ character: u'abcdefghijk'. Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects 8 hex digits, not 4.

In [1]: "\xF0\x9F\x8C\x80".decode('utf-8')
Out[1]: u'\U0001f300'

In [2]: u'\U0001F300'.encode('utf-8')
Out[2]: '\xf0\x9f\x8c\x80'

In [3]: u'\ud83c\udf00'.encode('utf-8')
Out[3]: '\xf0\x9f\x8c\x80'

\uhhhh     --> Unicode character with 16-bit hex value  
\Uhhhhhhhh --> Unicode character with 32-bit hex value

In Unicode escapes, the first form gives four hex digits to encode a 2-byte (16-bit) character code point, and the second gives eight hex digits for a 4-byte (32-bit) code point. Byte strings support only hex escapes for encoded text and other forms of byte-based data

like image 36
Aaron Avatar answered Sep 24 '22 07:09

Aaron