Difference between decode and unicode?

Question

According to this test:

# -*- coding: utf-8 -*-

ENCODING = 'utf-8'

# what is the difference between decode and unicode?
test_cases = [
    'aaaaa',
    'ááááá',
    'ℕℤℚℝℂ',
]
FORMAT = '%-10s %5d %-10s %-10s %5d %-10s %10s'
for text in test_cases :
    decoded = text.decode(ENCODING)
    unicoded = unicode(text, ENCODING)
    equal = decoded == unicoded
    print FORMAT % (decoded, len(decoded), type(decoded), unicoded, len(unicoded), type(unicoded), equal)

There is no difference between .decode() and unicode():

aaaaa          5 <type 'unicode'> aaaaa          5 <type 'unicode'>       True
ááááá          5 <type 'unicode'> ááááá          5 <type 'unicode'>       True
ℕℤℚℝℂ          5 <type 'unicode'> ℕℤℚℝℂ          5 <type 'unicode'>       True

Am I right? If so, why do we have two different ways of accomplishing the same thing? Which one should I use? Is there any subtle difference?

jochen · Accepted Answer

Comparing the documentation for the two functions (here and here), the differences between the two methods seem indeed very minor. The unicode function is documented as

If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised. Error handling is done according to errors; this specifies the treatment of characters which are invalid in the input encoding. If errors is 'strict' (the default), a ValueError is raised on errors, ...

whereas the description for string.decode states

Decodes the string using the codec registered for encoding. encoding defaults to the default string encoding. errors may be given to set a different error handling scheme. The default is 'strict', meaning that encoding errors raise UnicodeError. ...

Thus, the only differences seem to be that unicode also works for character buffers and that the error returned for invalid input differs (ValueError vs. UnicodeError). Another, minor difference is in backwards compatibility: unicode is documented as being "New in version 2.0" whereas string.decode is "New in version 2.2".

Given the above, which method to use seems to be entirely a matter of taste.

Tim Zimmermann · Answer

decode:
Decodes the string using the codec registered for encoding. encoding defaults to the default string encoding.
http://docs.python.org/2/library/stdtypes.html?#str.decode

unicode:
Return the Unicode string version of object [...].
See: http://docs.python.org/2/library/functions.html#unicode

Since you have UTF-8 as your encoding, the functions return the same. If you choose another encoding, they should return different things.

Difference between decode and unicode?

Tags:

python

unicode

python-2.7

blueFast

2 Answers

jochen

Tim Zimmermann

Recent Activity

Donate For Us

Difference between decode and unicode?

Tags:

python

unicode

python-2.7

blueFast

2 Answers

jochen

Tim Zimmermann

Related questions

Recent Activity

Donate For Us