When profiling our code I was surprised to find millions of calls to
C:\Python26\lib\encodings\utf_8.py:15(decode)
I started debugging and found that across our code base there are many small bugs, usually comparing a string to a unicode or adding a sting and a unicode. Python graciously decodes the strings and performs the following operations in unicode.
How kind. But expensive!
I am fluent in unicode, having read Joel Spolsky and Dive Into Python...
I try to keep our code internals in unicode only.
My question - can I turn off this pythonic nice-guy behavior? At least until I find all these bugs and fix them (usually by adding a u'u')?
Some of them are extremely hard to find (a variable that is sometimes a string...).
Python 2.6.5 (and I can't switch to 3.x).
Python supports the string type and the unicode type. A string is a sequence of chars while a unicode is a sequence of "pointers". The unicode is an in-memory representation of the sequence and every symbol on it is not a char but a number (in hex format) intended to select a char in a map.
Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.
In python, text could be presented using unicode string or bytes. Unicode is a standard for encoding character. Unicode string is a python data structure that can store zero or more unicode characters.
The following should work:
>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('undefined')
>>> u"abc" + u"xyz"
u'abcxyz'
>>> u"abc" + "xyz"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/encodings/undefined.py", line 22, in decode
raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding
reload(sys)
in the snippet above is only necessary here since normally sys.setdefaultencoding
is supposed to go in a sitecustomize.py
file in your Python site-packages
directory (it's advisable to do that).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With