Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String In python with my unicode?

Python 3.2 (r32:88445, Feb 20 2011, 21:29:02) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> str_version = 'នយោបាយ'
>>> type(str_version)
<class 'str'>
>>> print (str_version)
នយោបាយ
>>> unicode_version = 'នយោបាយ'.decode('utf-8')
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    unicode_version = 'នយោបាយ'.decode('utf-8')
AttributeError: 'str' object has no attribute 'decode'
>>> 

What the problem with my unicode string?

like image 837
kn3l Avatar asked Mar 26 '11 20:03

kn3l


People also ask

How do you Unicode a string in Python?

To allow working with Unicode characters, Python 2 has a unicode type which is a collection of Unicode code points (like Python 3's str type). The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.

Can Python handle Unicode?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.

How do you add a Unicode to a string?

Inserting Unicode characters To insert a Unicode character, type the character code, press ALT, and then press X. For example, to type a dollar symbol ($), type 0024, press ALT, and then press X. For more Unicode character codes, see Unicode character code charts by script.


2 Answers

There is nothing wrong with your string! You just have confused encode() and decode(). The string is meaningful symbols. To turn it into bytes that could be stored in a file or transmitted over the Internet, use encode() with an encoding like UTF-8. Each encoding is a scheme for converting meaningful symbols to flat bytes of output.

When the time comes to do the opposite — to take some raw bytes from a file or a socket and turn them into symbols like letters and numbers — you will decode the bytes using the decode() method of bytestrings in Python 3.

>>> str_version = 'នយោបាយ'
>>> str_version.encode('utf-8')
b'\xe1\x9e\x93\xe1\x9e\x99\xe1\x9f\x84\xe1\x9e\x94\xe1\x9e\xb6\xe1\x9e\x99'

See that big long line of bytes? Those are the bytes that UTF-8 uses to represent your string, if you need to transmit the string over a network, or store them in a document. There are many other encodings in use, but it seems to be the most popular. Each encoding can turn meaningful symbols like ន and យោ into bytes — the little 8-bit numbers with which computers communicate.

>>> rawbytes = str_version.encode('utf-8')
>>> rawbytes
b'\xe1\x9e\x93\xe1\x9e\x99\xe1\x9f\x84\xe1\x9e\x94\xe1\x9e\xb6\xe1\x9e\x99'
>>> rawbytes.decode('utf-8')
'នយោបាយ'
like image 67
Brandon Rhodes Avatar answered Sep 21 '22 13:09

Brandon Rhodes


You're reading the 2.x docs. str.decode() (and bytes.encode()) was dropped in 3.x. And str is already a Unicode string; there's no need to decode it.

like image 45
Ignacio Vazquez-Abrams Avatar answered Sep 20 '22 13:09

Ignacio Vazquez-Abrams