Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unicode() vs. str.decode() for a utf8 encoded byte string (python 2.x)

Is there any reason to prefer unicode(somestring, 'utf8') as opposed to somestring.decode('utf8')?

My only thought is that .decode() is a bound method so python may be able to resolve it more efficiently, but correct me if I'm wrong.

like image 563
ʞɔıu Avatar asked Jan 13 '09 19:01

ʞɔıu


People also ask

How do I decode a UTF-8 string in Python?

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.

How are str type strings different from Unicode strings in Python?

Unicode is a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str . In Python 3, unicode was renamed to str and there is a new bytes type for a plain sequence of bytes.

What is Unicode encode and decode?

Encoding is the process of transforming a set of Unicode characters into a sequence of bytes. Decoding is the process of transforming a sequence of encoded bytes into a set of Unicode characters. The Unicode Standard assigns a code point (a number) to each character in every supported script.

What is the difference between string and Unicode?

Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.


2 Answers

I'd prefer 'something'.decode(...) since the unicode type is no longer there in Python 3.0, while text = b'binarydata'.decode(encoding) is still valid.

like image 123
dF. Avatar answered Oct 11 '22 12:10

dF.


It's easy to benchmark it:

>>> from timeit import Timer
>>> ts = Timer("s.decode('utf-8')", "s = 'ééé'")
>>> ts.timeit()
8.9185450077056885
>>> tu = Timer("unicode(s, 'utf-8')", "s = 'ééé'") 
>>> tu.timeit()
2.7656929492950439
>>> 

Obviously, unicode() is faster.

FWIW, I don't know where you get the impression that methods would be faster - it's quite the contrary.

like image 33
bruno desthuilliers Avatar answered Oct 11 '22 12:10

bruno desthuilliers