Working with Python 2.7, I'm wondering what real advantage there is in using the type <code>unicode</code> instead of <code>str</code>, as both of them seem to be able to hold Unicode strings. Is there any special reason apart from being able to set Unicode codes in <code>unicode</code> strings using the escape char <code>\</code>?: Executing a module with: <pre class="prettyprint"><code># -*- coding: utf-8 -*- a = 'á' ua = u'á' print a, ua </code></pre> Results in: á, á EDIT: More testing using Python shell: <pre class="prettyprint"><code>>>> a = 'á' >>> a '\xc3\xa1' >>> ua = u'á' >>> ua u'\xe1' >>> ua.encode('utf8') '\xc3\xa1' >>> ua.encode('latin1') '\xe1' >>> ua u'\xe1' </code></pre> So, the <code>unicode</code> string seems to be encoded using <code>latin1</code> instead of <code>utf-8</code> and the raw string is encoded using <code>utf-8</code>? I'm even more confused now! :S

<code>unicode</code> is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. <code>utf-8</code>, <code>latin-1</code>...). Note that <code>unicode</code> is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want. On the contrary <code>str</code> in Python 2 is a plain sequence of bytes. It does not represent text! You can think of <code>unicode</code> as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via <code>str</code>. Note: In Python 3, <code>unicode</code> was renamed to <code>str</code> and there is a new <code>bytes</code> type for a plain sequence of bytes. Some differences that you can see: <pre class="prettyprint"><code>>>> len(u'à') # a single code point 1 >>> len('à') # by default utf-8 -> takes two bytes 2 >>> len(u'à'.encode('utf-8')) 2 >>> len(u'à'.encode('latin1')) # in latin1 it takes one byte 1 >>> print u'à'.encode('utf-8') # terminal encoding is utf-8 à >>> print u'à'.encode('latin1') # it cannot understand the latin1 byte � </code></pre> Note that using <code>str</code> you have a lower-level control on the single bytes of a specific encoding representation, while using <code>unicode</code> you can only control at the code-point level. For example you can do: <pre class="prettyprint"><code>>>> 'àèìòù' '\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9' >>> print 'àèìòù'.replace('\xa8', '') à�ìòù </code></pre> What before was valid UTF-8, isn't anymore. Using a unicode string you cannot operate in such a way that the resulting string isn't valid unicode text. You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.

Python str vs unicode types

Tags:

python

string

unicode

Working with Python 2.7, I'm wondering what real advantage there is in using the type unicode instead of str, as both of them seem to be able to hold Unicode strings. Is there any special reason apart from being able to set Unicode codes in unicode strings using the escape char \?:

Executing a module with:

# -*- coding: utf-8 -*-  a = 'á' ua = u'á' print a, ua

Results in: á, á

EDIT:

More testing using Python shell:

>>> a = 'á' >>> a '\xc3\xa1' >>> ua = u'á' >>> ua u'\xe1' >>> ua.encode('utf8') '\xc3\xa1' >>> ua.encode('latin1') '\xe1' >>> ua u'\xe1'

So, the unicode string seems to be encoded using latin1 instead of utf-8 and the raw string is encoded using utf-8? I'm even more confused now! :S

800

asked Aug 03 '13 15:08

Caumons

1 Answers

unicode is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1...).

Note that unicode is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.

On the contrary str in Python 2 is a plain sequence of bytes. It does not represent text!

You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str.

Note: In Python 3, unicode was renamed to str and there is a new bytes type for a plain sequence of bytes.

Some differences that you can see:

>>> len(u'à')  # a single code point 1 >>> len('à')   # by default utf-8 -> takes two bytes 2 >>> len(u'à'.encode('utf-8')) 2 >>> len(u'à'.encode('latin1'))  # in latin1 it takes one byte 1 >>> print u'à'.encode('utf-8')  # terminal encoding is utf-8 à >>> print u'à'.encode('latin1') # it cannot understand the latin1 byte �

Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level. For example you can do:

>>> 'àèìòù' '\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9' >>> print 'àèìòù'.replace('\xa8', '') à�ìòù

What before was valid UTF-8, isn't anymore. Using a unicode string you cannot operate in such a way that the resulting string isn't valid unicode text. You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.

answered Oct 07 '22 11:10

Bakuriu

Related questions
                            
                                Python: Append item to list N times
                            
                                How can I use a pip requirements file to uninstall as well as install packages?
                            
                                How to convert a timezone aware string to datetime in Python without dateutil?
                            
                                Run code before and after each test in py.test?
                            
                                Why doesn't requests.get() return? What is the default timeout that requests.get() uses?
                            
                                Counting the number of non-NaN elements in a numpy ndarray in Python
                            
                                How to implement the --verbose or -v option into a script?
                            
                                How to execute ipdb.set_trace() at will while running pytest tests
                            
                                Platform independent path concatenation using "/" , "\"?
                            
                                method of iterating over sqlalchemy model's defined columns?
                            
                                Get an attribute value based on the name attribute with BeautifulSoup
                            
                                Python strip with \n [duplicate]
                            
                                Create a file if it doesn't exist
                            
                                Convert number strings with commas in pandas DataFrame to float
                            
                                Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign
                            
                                How to generate keyboard events?
                            
                                How to create a user in Django?
                            
                                How do I convert a list into a string with spaces in Python?
                            
                                Python os.path.join on Windows
                            
                                How to apply a logical operator to all elements in a python list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With