Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between decode and unicode?

According to this test:

# -*- coding: utf-8 -*-

ENCODING = 'utf-8'

# what is the difference between decode and unicode?
test_cases = [
    'aaaaa',
    'ááááá',
    'ℕℤℚℝℂ',
]
FORMAT = '%-10s %5d %-10s %-10s %5d %-10s %10s'
for text in test_cases :
    decoded = text.decode(ENCODING)
    unicoded = unicode(text, ENCODING)
    equal = decoded == unicoded
    print FORMAT % (decoded, len(decoded), type(decoded), unicoded, len(unicoded), type(unicoded), equal)

There is no difference between .decode() and unicode():

aaaaa          5 <type 'unicode'> aaaaa          5 <type 'unicode'>       True
ááááá          5 <type 'unicode'> ááááá          5 <type 'unicode'>       True
ℕℤℚℝℂ          5 <type 'unicode'> ℕℤℚℝℂ          5 <type 'unicode'>       True

Am I right? If so, why do we have two different ways of accomplishing the same thing? Which one should I use? Is there any subtle difference?

like image 271
blueFast Avatar asked Dec 18 '13 10:12

blueFast


2 Answers

Comparing the documentation for the two functions (here and here), the differences between the two methods seem indeed very minor. The unicode function is documented as

If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised. Error handling is done according to errors; this specifies the treatment of characters which are invalid in the input encoding. If errors is 'strict' (the default), a ValueError is raised on errors, ...

whereas the description for string.decode states

Decodes the string using the codec registered for encoding. encoding defaults to the default string encoding. errors may be given to set a different error handling scheme. The default is 'strict', meaning that encoding errors raise UnicodeError. ...

Thus, the only differences seem to be that unicode also works for character buffers and that the error returned for invalid input differs (ValueError vs. UnicodeError). Another, minor difference is in backwards compatibility: unicode is documented as being "New in version 2.0" whereas string.decode is "New in version 2.2".

Given the above, which method to use seems to be entirely a matter of taste.

like image 136
jochen Avatar answered Sep 30 '22 15:09

jochen


decode:
Decodes the string using the codec registered for encoding. encoding defaults to the default string encoding.
http://docs.python.org/2/library/stdtypes.html?#str.decode

unicode:
Return the Unicode string version of object [...].
See: http://docs.python.org/2/library/functions.html#unicode

Since you have UTF-8 as your encoding, the functions return the same. If you choose another encoding, they should return different things.

like image 36
Tim Zimmermann Avatar answered Sep 30 '22 17:09

Tim Zimmermann