Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a unicode string? [closed]

What exactly is a unicode string?

What's the difference between a regular string and unicode string?

What is utf-8?

I'm trying to learn Python right now, and I keep hearing this buzzword. What does the code below do?

i18n Strings (Unicode)

> ustring = u'A unicode \u018e string \xf1' > ustring u'A unicode \u018e string \xf1'  ## (ustring from above contains a unicode string) > s = ustring.encode('utf-8') > s 'A unicode \xc6\x8e string \xc3\xb1'  ## bytes of utf-8 encoding > t = unicode(s, 'utf-8')             ## Convert bytes back to a unicode string > t == ustring                      ## It's the same as the original, yay! True 

Files Unicode

import codecs  f = codecs.open('foo.txt', 'rU', 'utf-8') for line in f: # here line is a *unicode* string 
like image 279
Stevanus Iskandar Avatar asked Feb 16 '14 07:02

Stevanus Iskandar


People also ask

What is a Unicode string?

Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.

What is a Unicode string in Python?

To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.

What is the difference between string and Unicode string?

Unicode, on the other hand, has tens of thousands of characters. That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes. Standard Python strings are really byte strings, and a Python character is really a byte.

What is Unicode text example?

A code point takes the form of U+<hex-code> , ranging from U+0000 to U+10FFFF . An example code point looks like this: U+004F . Its meaning depends on the character encoding used. Unicode defines different characters encodings, the most used ones being UTF-8, UTF-16 and UTF-32.


1 Answers

Update: Python 3

In Python 3, Unicode strings are the default. The type str is a collection of Unicode code points, and the type bytes is used for representing collections of 8-bit integers (often interpreted as ASCII characters).

Here is the code from the question, updated for Python 3:

>>> my_str = 'A unicode \u018e string \xf1' # no need for "u" prefix # the escape sequence "\u" denotes a Unicode code point (in hex) >>> my_str 'A unicode Ǝ string ñ' # the Unicode code points U+018E and U+00F1 were displayed # as their corresponding glyphs >>> my_bytes = my_str.encode('utf-8') # convert to a bytes object >>> my_bytes b'A unicode \xc6\x8e string \xc3\xb1' # the "b" prefix means a bytes literal # the escape sequence "\x" denotes a byte using its hex value # the code points U+018E and U+00F1 were encoded as 2-byte sequences >>> my_str2 = my_bytes.decode('utf-8') # convert back to str >>> my_str2 == my_str True 

Working with files:

>>> f = open('foo.txt', 'r') # text mode (Unicode) >>> # the platform's default encoding (e.g. UTF-8) is used to decode the file >>> # to set a specific encoding, use open('foo.txt', 'r', encoding="...") >>> for line in f: >>>     # here line is a str object  >>> f = open('foo.txt', 'rb') # "b" means binary mode (bytes) >>> for line in f: >>>     # here line is a bytes object 

Historical answer: Python 2

In Python 2, the str type was a collection of 8-bit characters (like Python 3's bytes type). The English alphabet can be represented using these 8-bit characters, but symbols such as Ω, и, ±, and ♠ cannot.

Unicode is a standard for working with a wide range of characters. Each symbol has a code point (a number), and these code points can be encoded (converted to a sequence of bytes) using a variety of encodings.

UTF-8 is one such encoding. The low code points are encoded using a single byte, and higher code points are encoded as sequences of bytes.

To allow working with Unicode characters, Python 2 has a unicode type which is a collection of Unicode code points (like Python 3's str type). The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.

When the Python interpreter displays the value of ustring, it escapes two of the characters (Ǝ and ñ) because they are not in the standard printable range.

The line s = unistring.encode('utf-8') encodes the Unicode string using UTF-8. This converts each code point to the appropriate byte or sequence of bytes. The result is a collection of bytes, which is returned as a str. The size of s is 22 bytes, because two of the characters have high code points and are encoded as a sequence of two bytes rather than a single byte.

When the Python interpreter displays the value of s, it escapes four bytes that are not in the printable range (\xc6, \x8e, \xc3, and \xb1). The two pairs of bytes are not treated as single characters like before because s is of type str, not unicode.

The line t = unicode(s, 'utf-8') does the opposite of encode(). It reconstructs the original code points by looking at the bytes of s and parsing byte sequences. The result is a Unicode string.

The call to codecs.open() specifies utf-8 as the encoding, which tells Python to interpret the contents of the file (a collection of bytes) as a Unicode string that has been encoded using UTF-8.

like image 178
tom Avatar answered Oct 18 '22 23:10

tom