Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between UTF8-in literal and unicode point?

I came a cross this website which show the Unicode table.

when I print the letter 'ספר':

>>> x = 'ספר'
>>> x
'\xd7\xa1\xd7\xa4\xd7\xa8'

I get this characters '\xd7\xa1\xd7\xa4\xd7\xa8'.

I think that python encode the word 'ספר' with utf-8 Unicode, because it's the default, right?

but when I run this code:

>>> x = u'ספר'
>>> x
u'\u05e1\u05e4\u05e8'

I get this u'\u05e1\u05e4\u05e8', which is a Unicode point, right?

How to convert from utf8-literal to Unicode point?

like image 974
david Avatar asked Nov 27 '14 09:11

david


1 Answers

@In the first sample you created a byte string (type str). Your terminal determined the encoding (UTF-8 in this case).

In your second sample, you created a Unicode string (type unicode). Python auto-detected the encoding your terminal uses (from sys.stdin.encoding) and decoded the bytes from UTF-8 to Unicode code points.

You can make the same conversion from byte string to Unicode string by decoding:

unicode_x = bytestring_x.decode('utf8')

To go the other direction, you need to encode:

bytestring_x = unicode_x.encode('utf8')

You specified your literals by using the actual UTF-8 bytes for the characters; this works fine in a terminal but not in Python source code; Python 2 source code is loaded as ASCII text only. You can change this by setting a source code encoding declaration. This is specified in PEP 263; it has to be the first or second line in your source file. For example:

# encoding: UTF-8

or you can stick to \uhhhh and \xhh escape sequences to represent non-ASCII characters.

You probably want to read up about the difference between Unicode and encoded (binary) byte strings, and how that relates to Python:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

  • The Python Unicode HOWTO

  • Pragmatic Unicode by Ned Batchelder

like image 180
Martijn Pieters Avatar answered Sep 30 '22 23:09

Martijn Pieters