What is the difference between UTF8-in literal and unicode point?

Question

I came a cross this website which show the Unicode table.

when I print the letter 'ספר':

>>> x = 'ספר'
>>> x
'\xd7\xa1\xd7\xa4\xd7\xa8'

I get this characters '\xd7\xa1\xd7\xa4\xd7\xa8'.

I think that python encode the word 'ספר' with utf-8 Unicode, because it's the default, right?

but when I run this code:

>>> x = u'ספר'
>>> x
u'\u05e1\u05e4\u05e8'

I get this u'\u05e1\u05e4\u05e8', which is a Unicode point, right?

How to convert from utf8-literal to Unicode point?

Martijn Pieters · Accepted Answer

@In the first sample you created a byte string (type str). Your terminal determined the encoding (UTF-8 in this case).

In your second sample, you created a Unicode string (type unicode). Python auto-detected the encoding your terminal uses (from sys.stdin.encoding) and decoded the bytes from UTF-8 to Unicode code points.

You can make the same conversion from byte string to Unicode string by decoding:

unicode_x = bytestring_x.decode('utf8')

To go the other direction, you need to encode:

bytestring_x = unicode_x.encode('utf8')

You specified your literals by using the actual UTF-8 bytes for the characters; this works fine in a terminal but not in Python source code; Python 2 source code is loaded as ASCII text only. You can change this by setting a source code encoding declaration. This is specified in PEP 263; it has to be the first or second line in your source file. For example:

# encoding: UTF-8

or you can stick to \uhhhh and \xhh escape sequences to represent non-ASCII characters.

You probably want to read up about the difference between Unicode and encoded (binary) byte strings, and how that relates to Python:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

What is the difference between UTF8-in literal and unicode point?

Tags:

python

unicode

utf-8

python-2.7

david

1 Answers

Martijn Pieters

Recent Activity

Donate For Us

What is the difference between UTF8-in literal and unicode point?

Tags:

python

unicode

utf-8

python-2.7

david

1 Answers

Martijn Pieters

Related questions

Recent Activity

Donate For Us