myString = 'éíěřáé'
I need to decode this string to unicode. Is there any difference between folowing usages and between these two methods in general?
myString.decode(encoding='UTF-8', errors='ignore')
and
unicode(myString, encoding='UTF-8', errors='ignore')
You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str . Note: In Python 3, unicode was renamed to str and there is a new bytes type for a plain sequence of bytes. What before was valid UTF-8, isn't anymore.
Unicode, on the other hand, has tens of thousands of characters. That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes. Standard Python strings are really byte strings, and a Python character is really a byte.
A character in a str represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters. bytes objects give you access to the underlying bytes.
decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.
The unicode
constructor can take other types apart from strings:
>>> unicode(10)
u'10'
For the bytestring case, however, the two forms are mostly equivalent. Some encoding options are not valid for the unicode
constructor as they do not result in unicode output, but are valid for the .decode
method of bytestrings, such as 'hex'
:
>>> unicode('10', encoding='hex')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: decoder did not return an unicode object (type=str)
They're essentially the same, but with some minor performance shortcuts in either case; str.decode
knows that its argument is a string, so it can shortcut type checking of its argument, while unicode.__new__
has shortcuts for some common encodings including UTF-8.
Both methods call into PyCodec_Decode
in the general case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With