Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python str vs unicode types

Working with Python 2.7, I'm wondering what real advantage there is in using the type unicode instead of str, as both of them seem to be able to hold Unicode strings. Is there any special reason apart from being able to set Unicode codes in unicode strings using the escape char \?:

Executing a module with:

# -*- coding: utf-8 -*-  a = 'á' ua = u'á' print a, ua 

Results in: á, á

EDIT:

More testing using Python shell:

>>> a = 'á' >>> a '\xc3\xa1' >>> ua = u'á' >>> ua u'\xe1' >>> ua.encode('utf8') '\xc3\xa1' >>> ua.encode('latin1') '\xe1' >>> ua u'\xe1' 

So, the unicode string seems to be encoded using latin1 instead of utf-8 and the raw string is encoded using utf-8? I'm even more confused now! :S

like image 800
Caumons Avatar asked Aug 03 '13 15:08

Caumons


People also ask

What is the difference between Unicode and string in Python?

Unicode, on the other hand, has tens of thousands of characters. That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes. Standard Python strings are really byte strings, and a Python character is really a byte.

What is the difference between string and Unicode?

Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.

What are the differences between bytes STR and Unicode?

bytes and str instances can't be used together with operators (like > or +). In Python 2, str contains sequences of 8-bit values, unicode contains sequences of Unicode characters. str and unicode can be used together with operators if the str only contains 7-bit ASCII characters.

Is Python a Unicode string?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters.

What is the difference between 'Unicode' and 'str' in Python?

One is ‘unicode’ and other is ‘str’. Type ‘unicode’ is meant for working with codepoints of characters. Type ‘str’ is meant for working with encoded binary representation of characters. A ‘unicode’ object needs to be converted to ‘str’ object before Python can write the character to a file.

What is Unicode in Python?

You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str. Note: In Python 3, unicode was renamed to str and there is a new bytes type for a plain sequence of bytes.

What is STR type in Python?

Type ‘str’ is meant for working with encoded binary representation of characters. A ‘unicode’ object needs to be converted to ‘str’ object before Python can write the character to a file. A ‘unicode’ object needs to be converted to ‘str’ object for the character to be printed.

What is the difference between__str__() and__Unicode__()?

When __unicode__ () is omitted and someone calls unicode (o) or u"%s"%o, Python calls o.__str__ () and converts to unicode using the system encoding. (See documentation of __unicode__ () .) The opposite is not true. If you implement __unicode__ () but not __str__ (), then when someone calls str (o) or "%s"%o, Python returns repr (o).


1 Answers

unicode is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1...).

Note that unicode is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.

On the contrary str in Python 2 is a plain sequence of bytes. It does not represent text!

You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str.

Note: In Python 3, unicode was renamed to str and there is a new bytes type for a plain sequence of bytes.

Some differences that you can see:

>>> len(u'à')  # a single code point 1 >>> len('à')   # by default utf-8 -> takes two bytes 2 >>> len(u'à'.encode('utf-8')) 2 >>> len(u'à'.encode('latin1'))  # in latin1 it takes one byte 1 >>> print u'à'.encode('utf-8')  # terminal encoding is utf-8 à >>> print u'à'.encode('latin1') # it cannot understand the latin1 byte � 

Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level. For example you can do:

>>> 'àèìòù' '\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9' >>> print 'àèìòù'.replace('\xa8', '') à�ìòù 

What before was valid UTF-8, isn't anymore. Using a unicode string you cannot operate in such a way that the resulting string isn't valid unicode text. You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.

like image 70
Bakuriu Avatar answered Oct 07 '22 11:10

Bakuriu