Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python what's the difference between str(u'a') and u'a'.encode('utf-8')

Tags:

python

unicode

As title, is there a reason not to use str() to cast unicode string to str??

>>> str(u'a')
'a'
>>> str(u'a').__class__
<type 'str'>
>>> u'a'.encode('utf-8')
'a'
>>> u'a'.encode('utf-8').__class__
<type 'str'>
>>> u'a'.encode().__class__
<type 'str'>

UPDATE: thanks for the answer, also didn't know if I create a string using special character it will automatically convert to utf-8

>>> a = '€'
>>> a.__class__
<type 'str'>
>>> a
'\xe2\x82\xac'

Also is a Unicode object in python 3

like image 802
James Lin Avatar asked Aug 27 '12 21:08

James Lin


People also ask

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.

What does encode () do in Python?

Definition and Usage. The encode() method encodes the string, using the specified encoding. If no encoding is specified, UTF-8 will be used.

What is the difference between string and Unicode string in Python?

Unicode, on the other hand, has tens of thousands of characters. That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes. Standard Python strings are really byte strings, and a Python character is really a byte.

What is the difference between string and Unicode?

Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.


1 Answers

When you write str(u'a') it converts the Unicode string to a bytestring using the default encoding which (unless you've gone to the trouble of changing it) will be ASCII.

The second version explicitly encodes the string as UTF-8.

The difference is more apparent if you try with a string containing non-ASCII characters. The second version will still work:

>>> u'€'.encode('utf-8')
'\xc2\x80'

The first version will give an exception:

>>> str(u'€')

Traceback (most recent call last):
  File "", line 1, in 
    str(u'€')
UnicodeEncodeError: 'ascii' codec can't encode character u'\x80' in position 0: ordinal not in range(128)
like image 64
Mark Byers Avatar answered Oct 06 '22 23:10

Mark Byers