I came a cross this website which show the Unicode table.
when I print the letter 'ספר':
>>> x = 'ספר'
>>> x
'\xd7\xa1\xd7\xa4\xd7\xa8'
I get this characters '\xd7\xa1\xd7\xa4\xd7\xa8'
.
I think that python encode the word 'ספר' with utf-8 Unicode, because it's the default, right?
but when I run this code:
>>> x = u'ספר'
>>> x
u'\u05e1\u05e4\u05e8'
I get this u'\u05e1\u05e4\u05e8'
, which is a Unicode point, right?
How to convert from utf8-literal to Unicode point?
@In the first sample you created a byte string (type str
). Your terminal determined the encoding (UTF-8 in this case).
In your second sample, you created a Unicode string (type unicode
). Python auto-detected the encoding your terminal uses (from sys.stdin.encoding
) and decoded the bytes from UTF-8 to Unicode code points.
You can make the same conversion from byte string to Unicode string by decoding:
unicode_x = bytestring_x.decode('utf8')
To go the other direction, you need to encode:
bytestring_x = unicode_x.encode('utf8')
You specified your literals by using the actual UTF-8 bytes for the characters; this works fine in a terminal but not in Python source code; Python 2 source code is loaded as ASCII text only. You can change this by setting a source code encoding declaration. This is specified in PEP 263; it has to be the first or second line in your source file. For example:
# encoding: UTF-8
or you can stick to \uhhhh
and \xhh
escape sequences to represent non-ASCII characters.
You probably want to read up about the difference between Unicode and encoded (binary) byte strings, and how that relates to Python:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With