Python 3.2 (r32:88445, Feb 20 2011, 21:29:02) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> str_version = 'នយោបាយ'
>>> type(str_version)
<class 'str'>
>>> print (str_version)
នយោបាយ
>>> unicode_version = 'នយោបាយ'.decode('utf-8')
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
unicode_version = 'នយោបាយ'.decode('utf-8')
AttributeError: 'str' object has no attribute 'decode'
>>>
What the problem with my unicode string?
To allow working with Unicode characters, Python 2 has a unicode type which is a collection of Unicode code points (like Python 3's str type). The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.
Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.
Inserting Unicode characters To insert a Unicode character, type the character code, press ALT, and then press X. For example, to type a dollar symbol ($), type 0024, press ALT, and then press X. For more Unicode character codes, see Unicode character code charts by script.
There is nothing wrong with your string! You just have confused encode()
and decode()
. The string is meaningful symbols. To turn it into bytes that could be stored in a file or transmitted over the Internet, use encode()
with an encoding like UTF-8. Each encoding is a scheme for converting meaningful symbols to flat bytes of output.
When the time comes to do the opposite — to take some raw bytes from a file or a socket and turn them into symbols like letters and numbers — you will decode the bytes using the decode()
method of bytestrings in Python 3.
>>> str_version = 'នយោបាយ'
>>> str_version.encode('utf-8')
b'\xe1\x9e\x93\xe1\x9e\x99\xe1\x9f\x84\xe1\x9e\x94\xe1\x9e\xb6\xe1\x9e\x99'
See that big long line of bytes? Those are the bytes that UTF-8 uses to represent your string, if you need to transmit the string over a network, or store them in a document. There are many other encodings in use, but it seems to be the most popular. Each encoding can turn meaningful symbols like ន and យោ into bytes — the little 8-bit numbers with which computers communicate.
>>> rawbytes = str_version.encode('utf-8')
>>> rawbytes
b'\xe1\x9e\x93\xe1\x9e\x99\xe1\x9f\x84\xe1\x9e\x94\xe1\x9e\xb6\xe1\x9e\x99'
>>> rawbytes.decode('utf-8')
'នយោបាយ'
You're reading the 2.x docs. str.decode()
(and bytes.encode()
) was dropped in 3.x. And str
is already a Unicode string; there's no need to decode it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With