Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

byte string vs. unicode string. Python

Could you explain in detail what the difference is between byte string and Unicode string in Python. I have read this:

Byte code is simply the converted source code into arrays of bytes

Does it mean that Python has its own coding/encoding format? Or does it use the operation system settings? I don't understand. Could you please explain? Thank you!

like image 776
ashim Avatar asked Apr 08 '12 04:04

ashim


People also ask

What is the difference between byte string and Unicode string?

A character in a str represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters. bytes objects give you access to the underlying bytes.

What is Unicode and bytes in Python?

Python 3 string class (str) stores Unicode strings and a new byte string (bytes) class supports single byte strings. The two are different types so string expressions must use one form or the other. String literals are Unicode unless prefixed with a lower case b.

What is the difference between byte and Unicode?

Byte code is simply the converted source code into arrays of bytes and generated after compilation of source code and is understandable only for interpreter or you can say java run time environment. Unicode is character standard to represent alphabets of all the languages of world.

What is the difference between Unicode and string in Python?

Unicode, on the other hand, has tens of thousands of characters. That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes. Standard Python strings are really byte strings, and a Python character is really a byte.


2 Answers

No python does not use its own encoding. It will use any encoding that it has access to and that you specify. A character in a str represents one unicode character. However to represent more than 256 characters, individual unicode encodings use more than one byte per character to represent many characters. bytearray objects give you access to the underlaying bytes. str objects have the encode method that takes a string representing an encoding and returns the bytearray object that represents the string in that encoding. bytearray objects have the decode method that takes a string representing an encoding and returns the str that results from interpreting the bytearray as a string encoded in the the given encoding. Here's an example.

>>> a = "αά".encode('utf-8') >>> a b'\xce\xb1\xce\xac' >>> a.decode('utf-8') 'αά' 

We can see that UTF-8 is using four bytes, \xce, \xb1, \xce, and \xac to represent two characters. After the Spolsky article that Ignacio Vazquez-Abrams referred to, I would read the Python Unicode Howto.

like image 193
aaronasterling Avatar answered Oct 09 '22 01:10

aaronasterling


Here's an attempt at a simple explanation that only applies to Python 3. I hope that coming from a lay person, it would help to clear some confusion for the completely uninitiated. If there are any technical inaccuracies, please forgive me and feel free to point it out.

Suppose you create a string using Python 3 in the usual way:

stringobject = 'ant' 

stringobject would be a unicode string.

A unicode string is made up of unicode characters. In stringobject above, the unicode characters are the individual letters, e.g. a, n, t

Each unicode character is assigned a code point, which can be expressed as a sequence of hex digits (a hex digit can take on 16 values, ranging from 0-9 and A-F). For instance, the letter 'a' is equivalent to '\u0061', and 'ant' is equivalent to '\u0061\u006E\u0074'.

So you will find that if you type in,

stringobject = '\u0061\u006E\u0074' stringobject 

You will also get the output 'ant'.

Now, unicode is converted to bytes, in a process known as encoding. The reverse process of converting bytes to unicode is known as decoding.

How is this done? Since each hex digit can take on 16 different values, it can be reflected in a 4-bit binary sequence (e.g. the hex digit 0 can be expressed in binary as 0000, the hex digit 1 can be expressed as 0001 and so forth). If a unicode character has a code point consisting of four hex digits, it would need a 16-bit binary sequence to encode it.

Different encoding systems specify different rules for converting unicode to bits. Most importantly, encodings differ in the number of bits they use to express each unicode character.

For instance, the ASCII encoding system uses only 8 bits (1 byte) per character. Thus it can only encode unicode characters with code points up to two hex digits long (i.e. 256 different unicode characters). The UTF-8 encoding system uses 8 to 32 bits (1 to 4 bytes) per character, so it can encode unicode characters with code points up to 8 hex digits long, i.e. everything.

Running the following code:

byteobject = stringobject.encode('utf-8') byteobject, type(byteobject) 

converts a unicode string into a byte string using the utf-8 encoding system, and returns b'ant', bytes'.

Note that if you used 'ASCII' as the encoding system, you wouldn't run into any problems since all code points in 'ant' can be expressed with 1 byte. But if you had a unicode string containing characters with code points longer than two hex digits, you would get a UnicodeEncodeError.

Similarly,

stringobject = byteobject.decode('utf-8') stringobject, type(stringobject) 

gives you 'ant', str.

like image 25
runawaykid Avatar answered Oct 09 '22 01:10

runawaykid