Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I turn off implicit Python unicode conversions to find my mixed-strings bugs?

When profiling our code I was surprised to find millions of calls to
C:\Python26\lib\encodings\utf_8.py:15(decode)

I started debugging and found that across our code base there are many small bugs, usually comparing a string to a unicode or adding a sting and a unicode. Python graciously decodes the strings and performs the following operations in unicode.

How kind. But expensive!

I am fluent in unicode, having read Joel Spolsky and Dive Into Python...

I try to keep our code internals in unicode only.

My question - can I turn off this pythonic nice-guy behavior? At least until I find all these bugs and fix them (usually by adding a u'u')?

Some of them are extremely hard to find (a variable that is sometimes a string...).

Python 2.6.5 (and I can't switch to 3.x).

like image 230
Tal Weiss Avatar asked May 17 '10 18:05

Tal Weiss


People also ask

Is unicode the same as string in python?

Python supports the string type and the unicode type. A string is a sequence of chars while a unicode is a sequence of "pointers". The unicode is an in-memory representation of the sequence and every symbol on it is not a char but a number (in hex format) intended to select a char in a map.

What is the use of unicode string in Python?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.

Does Python recognize unicode?

In python, text could be presented using unicode string or bytes. Unicode is a standard for encoding character. Unicode string is a python data structure that can store zero or more unicode characters.


1 Answers

The following should work:

>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('undefined')
>>> u"abc" + u"xyz"
u'abcxyz'
>>> u"abc" + "xyz"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/encodings/undefined.py", line 22, in decode
    raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding

reload(sys) in the snippet above is only necessary here since normally sys.setdefaultencoding is supposed to go in a sitecustomize.py file in your Python site-packages directory (it's advisable to do that).

like image 81
ChristopheD Avatar answered Sep 21 '22 15:09

ChristopheD