Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting byte string in unicode string

I have a code such that:

a = "\u0432" b = u"\u0432" c = b"\u0432" d = c.decode('utf8')  print(type(a), a) print(type(b), b) print(type(c), c) print(type(d), d) 

And output:

<class 'str'> в <class 'str'> в <class 'bytes'> b'\\u0432' <class 'str'> \u0432 

Why in the latter case I see a character code, instead of the character? How I can transform Byte string to Unicode string that in case of an output I saw the character, instead of its code?

like image 880
Alex T Avatar asked Dec 12 '12 10:12

Alex T


People also ask

Can you convert byte into string?

Given a Byte value in Java, the task is to convert this byte value to string type. One method is to create a string variable and then append the byte value to the string variable with the help of + operator. This will directly convert the byte value to a string and add it in the string variable.

What is Unicode string and byte string?

A character in a str represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters. bytes objects give you access to the underlying bytes.

How do you Unicode a string in Python?

To allow working with Unicode characters, Python 2 has a unicode type which is a collection of Unicode code points (like Python 3's str type). The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.


1 Answers

In strings (or Unicode objects in Python 2), \u has a special meaning, namely saying, "here comes a Unicode character specified by it's Unicode ID". Hence u"\u0432" will result in the character в.

The b'' prefix tells you this is a sequence of 8-bit bytes, and bytes object has no Unicode characters, so the \u code has no special meaning. Hence, b"\u0432" is just the sequence of the bytes \,u,0,4,3 and 2.

Essentially you have an 8-bit string containing not a Unicode character, but the specification of a Unicode character.

You can convert this specification using the unicode escape encoder.

>>> c.decode('unicode_escape') 'в' 
like image 104
Lennart Regebro Avatar answered Oct 11 '22 16:10

Lennart Regebro