What is a unicode string? [closed]

Tags:

What exactly is a unicode string?

What's the difference between a regular string and unicode string?

What is utf-8?

I'm trying to learn Python right now, and I keep hearing this buzzword. What does the code below do?

i18n Strings (Unicode)

> ustring = u'A unicode \u018e string \xf1' > ustring u'A unicode \u018e string \xf1'  ## (ustring from above contains a unicode string) > s = ustring.encode('utf-8') > s 'A unicode \xc6\x8e string \xc3\xb1'  ## bytes of utf-8 encoding > t = unicode(s, 'utf-8')             ## Convert bytes back to a unicode string > t == ustring                      ## It's the same as the original, yay! True

Files Unicode

import codecs  f = codecs.open('foo.txt', 'rU', 'utf-8') for line in f: # here line is a *unicode* string

279

asked Feb 16 '14 07:02

Stevanus Iskandar

1 Answers

Update: Python 3

In Python 3, Unicode strings are the default. The type str is a collection of Unicode code points, and the type bytes is used for representing collections of 8-bit integers (often interpreted as ASCII characters).

Here is the code from the question, updated for Python 3:

>>> my_str = 'A unicode \u018e string \xf1' # no need for "u" prefix # the escape sequence "\u" denotes a Unicode code point (in hex) >>> my_str 'A unicode Ǝ string ñ' # the Unicode code points U+018E and U+00F1 were displayed # as their corresponding glyphs >>> my_bytes = my_str.encode('utf-8') # convert to a bytes object >>> my_bytes b'A unicode \xc6\x8e string \xc3\xb1' # the "b" prefix means a bytes literal # the escape sequence "\x" denotes a byte using its hex value # the code points U+018E and U+00F1 were encoded as 2-byte sequences >>> my_str2 = my_bytes.decode('utf-8') # convert back to str >>> my_str2 == my_str True

Working with files:

>>> f = open('foo.txt', 'r') # text mode (Unicode) >>> # the platform's default encoding (e.g. UTF-8) is used to decode the file >>> # to set a specific encoding, use open('foo.txt', 'r', encoding="...") >>> for line in f: >>>     # here line is a str object  >>> f = open('foo.txt', 'rb') # "b" means binary mode (bytes) >>> for line in f: >>>     # here line is a bytes object

Historical answer: Python 2

In Python 2, the str type was a collection of 8-bit characters (like Python 3's bytes type). The English alphabet can be represented using these 8-bit characters, but symbols such as Ω, и, ±, and ♠ cannot.

Unicode is a standard for working with a wide range of characters. Each symbol has a code point (a number), and these code points can be encoded (converted to a sequence of bytes) using a variety of encodings.

UTF-8 is one such encoding. The low code points are encoded using a single byte, and higher code points are encoded as sequences of bytes.

To allow working with Unicode characters, Python 2 has a unicode type which is a collection of Unicode code points (like Python 3's str type). The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.

When the Python interpreter displays the value of ustring, it escapes two of the characters (Ǝ and ñ) because they are not in the standard printable range.

The line s = unistring.encode('utf-8') encodes the Unicode string using UTF-8. This converts each code point to the appropriate byte or sequence of bytes. The result is a collection of bytes, which is returned as a str. The size of s is 22 bytes, because two of the characters have high code points and are encoded as a sequence of two bytes rather than a single byte.

When the Python interpreter displays the value of s, it escapes four bytes that are not in the printable range (\xc6, \x8e, \xc3, and \xb1). The two pairs of bytes are not treated as single characters like before because s is of type str, not unicode.

The line t = unicode(s, 'utf-8') does the opposite of encode(). It reconstructs the original code points by looking at the bytes of s and parsing byte sequences. The result is a Unicode string.

The call to codecs.open() specifies utf-8 as the encoding, which tells Python to interpret the contents of the file (a collection of bytes) as a Unicode string that has been encoded using UTF-8.

178

answered Oct 18 '22 23:10

tom

Related questions
                            
                                when to use pre_save, save, post_save in django?
                            
                                Python: Passing a class name as a parameter to a function?
                            
                                How to read a raw image using PIL?
                            
                                interpolate 3D volume with numpy and or scipy
                            
                                Python thread name doesn't show up on ps or htop
                            
                                The print of string constant is always attached with 'b' inTensorFlow [duplicate]
                            
                                How do you get Python documentation in Texinfo Info format?
                            
                                Classifying Documents into Categories
                            
                                What good are Python function annotations? [duplicate]
                            
                                What is a correct way to filter different loggers using python logging?
                            
                                How to format IPython html display of Pandas dataframe?
                            
                                Dropping time from datetime <[M8] in Pandas
                            
                                Matplotlib - Plot a plane and points in 3D simultaneously
                            
                                Keras flowFromDirectory get file names as they are being generated
                            
                                Python inheritance - how to call grandparent method?
                            
                                matplotlib Axes.plot() vs pyplot.plot()
                            
                                Python 2.7 not working anymore: cannot import name md5
                            
                                Why use pandas.assign rather than simply initialize new column?
                            
                                Making a python iterator go backwards?
                            
                                Pickle with custom classes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is a unicode string? [closed]

Tags:

python

unicode

utf-8

Stevanus Iskandar

People also ask

1 Answers

Update: Python 3

Historical answer: Python 2

tom

Recent Activity

Donate For Us