As title, is there a reason not to use str() to cast unicode string to str??
>>> str(u'a')
'a'
>>> str(u'a').__class__
<type 'str'>
>>> u'a'.encode('utf-8')
'a'
>>> u'a'.encode('utf-8').__class__
<type 'str'>
>>> u'a'.encode().__class__
<type 'str'>
UPDATE: thanks for the answer, also didn't know if I create a string using special character it will automatically convert to utf-8
>>> a = '€'
>>> a.__class__
<type 'str'>
>>> a
'\xe2\x82\xac'
Also is a Unicode object in python 3
UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
Definition and Usage. The encode() method encodes the string, using the specified encoding. If no encoding is specified, UTF-8 will be used.
Unicode, on the other hand, has tens of thousands of characters. That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes. Standard Python strings are really byte strings, and a Python character is really a byte.
Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.
When you write str(u'a')
it converts the Unicode string to a bytestring using the default encoding which (unless you've gone to the trouble of changing it) will be ASCII.
The second version explicitly encodes the string as UTF-8.
The difference is more apparent if you try with a string containing non-ASCII characters. The second version will still work:
>>> u'€'.encode('utf-8') '\xc2\x80'
The first version will give an exception:
>>> str(u'€') Traceback (most recent call last): File "", line 1, in str(u'€') UnicodeEncodeError: 'ascii' codec can't encode character u'\x80' in position 0: ordinal not in range(128)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With