What exactly is a unicode string?
What's the difference between a regular string and unicode string?
What is utf-8?
I'm trying to learn Python right now, and I keep hearing this buzzword. What does the code below do?
i18n Strings (Unicode)
> ustring = u'A unicode \u018e string \xf1' > ustring u'A unicode \u018e string \xf1' ## (ustring from above contains a unicode string) > s = ustring.encode('utf-8') > s 'A unicode \xc6\x8e string \xc3\xb1' ## bytes of utf-8 encoding > t = unicode(s, 'utf-8') ## Convert bytes back to a unicode string > t == ustring ## It's the same as the original, yay! True
Files Unicode
import codecs f = codecs.open('foo.txt', 'rU', 'utf-8') for line in f: # here line is a *unicode* string
Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.
To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.
Unicode, on the other hand, has tens of thousands of characters. That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes. Standard Python strings are really byte strings, and a Python character is really a byte.
A code point takes the form of U+<hex-code> , ranging from U+0000 to U+10FFFF . An example code point looks like this: U+004F . Its meaning depends on the character encoding used. Unicode defines different characters encodings, the most used ones being UTF-8, UTF-16 and UTF-32.
In Python 3, Unicode strings are the default. The type str
is a collection of Unicode code points, and the type bytes
is used for representing collections of 8-bit integers (often interpreted as ASCII characters).
Here is the code from the question, updated for Python 3:
>>> my_str = 'A unicode \u018e string \xf1' # no need for "u" prefix # the escape sequence "\u" denotes a Unicode code point (in hex) >>> my_str 'A unicode Ǝ string ñ' # the Unicode code points U+018E and U+00F1 were displayed # as their corresponding glyphs >>> my_bytes = my_str.encode('utf-8') # convert to a bytes object >>> my_bytes b'A unicode \xc6\x8e string \xc3\xb1' # the "b" prefix means a bytes literal # the escape sequence "\x" denotes a byte using its hex value # the code points U+018E and U+00F1 were encoded as 2-byte sequences >>> my_str2 = my_bytes.decode('utf-8') # convert back to str >>> my_str2 == my_str True
Working with files:
>>> f = open('foo.txt', 'r') # text mode (Unicode) >>> # the platform's default encoding (e.g. UTF-8) is used to decode the file >>> # to set a specific encoding, use open('foo.txt', 'r', encoding="...") >>> for line in f: >>> # here line is a str object >>> f = open('foo.txt', 'rb') # "b" means binary mode (bytes) >>> for line in f: >>> # here line is a bytes object
In Python 2, the str
type was a collection of 8-bit characters (like Python 3's bytes
type). The English alphabet can be represented using these 8-bit characters, but symbols such as Ω, и, ±, and ♠ cannot.
Unicode is a standard for working with a wide range of characters. Each symbol has a code point (a number), and these code points can be encoded (converted to a sequence of bytes) using a variety of encodings.
UTF-8 is one such encoding. The low code points are encoded using a single byte, and higher code points are encoded as sequences of bytes.
To allow working with Unicode characters, Python 2 has a unicode
type which is a collection of Unicode code points (like Python 3's str
type). The line ustring = u'A unicode \u018e string \xf1'
creates a Unicode string with 20 characters.
When the Python interpreter displays the value of ustring
, it escapes two of the characters (Ǝ and ñ) because they are not in the standard printable range.
The line s = unistring.encode('utf-8')
encodes the Unicode string using UTF-8. This converts each code point to the appropriate byte or sequence of bytes. The result is a collection of bytes, which is returned as a str
. The size of s
is 22 bytes, because two of the characters have high code points and are encoded as a sequence of two bytes rather than a single byte.
When the Python interpreter displays the value of s
, it escapes four bytes that are not in the printable range (\xc6
, \x8e
, \xc3
, and \xb1
). The two pairs of bytes are not treated as single characters like before because s
is of type str
, not unicode
.
The line t = unicode(s, 'utf-8')
does the opposite of encode()
. It reconstructs the original code points by looking at the bytes of s
and parsing byte sequences. The result is a Unicode string.
The call to codecs.open()
specifies utf-8
as the encoding, which tells Python to interpret the contents of the file (a collection of bytes) as a Unicode string that has been encoded using UTF-8.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With