According to this test:
# -*- coding: utf-8 -*-
ENCODING = 'utf-8'
# what is the difference between decode and unicode?
test_cases = [
'aaaaa',
'ááááá',
'ℕℤℚℝℂ',
]
FORMAT = '%-10s %5d %-10s %-10s %5d %-10s %10s'
for text in test_cases :
decoded = text.decode(ENCODING)
unicoded = unicode(text, ENCODING)
equal = decoded == unicoded
print FORMAT % (decoded, len(decoded), type(decoded), unicoded, len(unicoded), type(unicoded), equal)
There is no difference between .decode()
and unicode()
:
aaaaa 5 <type 'unicode'> aaaaa 5 <type 'unicode'> True
ááááá 5 <type 'unicode'> ááááá 5 <type 'unicode'> True
ℕℤℚℝℂ 5 <type 'unicode'> ℕℤℚℝℂ 5 <type 'unicode'> True
Am I right? If so, why do we have two different ways of accomplishing the same thing? Which one should I use? Is there any subtle difference?
Comparing the documentation for the two functions (here and here), the differences between the two methods seem indeed very minor. The unicode
function is documented as
If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised. Error handling is done according to errors; this specifies the treatment of characters which are invalid in the input encoding. If errors is 'strict' (the default), a ValueError is raised on errors, ...
whereas the description for string.decode
states
Decodes the string using the codec registered for encoding. encoding defaults to the default string encoding. errors may be given to set a different error handling scheme. The default is 'strict', meaning that encoding errors raise UnicodeError. ...
Thus, the only differences seem to be that unicode
also works for character buffers and that the error returned for invalid input differs (ValueError
vs. UnicodeError
). Another, minor difference is in backwards compatibility: unicode
is documented as being "New in version 2.0" whereas string.decode
is "New in version 2.2".
Given the above, which method to use seems to be entirely a matter of taste.
decode:
Decodes the string using the codec registered for encoding. encoding defaults to the default string encoding.
http://docs.python.org/2/library/stdtypes.html?#str.decode
unicode:
Return the Unicode string version of object [...].
See: http://docs.python.org/2/library/functions.html#unicode
Since you have UTF-8 as your encoding, the functions return the same. If you choose another encoding, they should return different things.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With