I have a large project where at various places problematic implicit Unicode conversions (coersions) were used in the form of e.g.:
someDynamicStr = "bar" # could come from various sources
# works
u"foo" + someDynamicStr
u"foo{}".format(someDynamicStr)
someDynamicStr = "\xff" # uh-oh
# raises UnicodeDecodeError
u"foo" + someDynamicStr
u"foo{}".format(someDynamicStr)
(Possibly other forms as well.)
Now I would like to track down those usages, especially those in actively used code.
It would be great if I could easily replace the unicode
constructor with a wrapper which checks whether the input is of type str
and the encoding
/errors
parameters are set to the default values and then notifies me (prints traceback or such).
/edit:
While not directly related to what I am looking for I came across this gloriously horrible hack for how to make the decode exception go away altogether (the decode one only, i.e. str
to unicode
, but not the other way around, see https://mail.python.org/pipermail/python-list/2012-July/627506.html).
I don't plan on using it but it might be interesting for those battling problems with invalid Unicode input and looking for a quick fix (but please think about the side effects):
import codecs
codecs.register_error("strict", codecs.ignore_errors)
codecs.register_error("strict", lambda x: (u"", x.end)) # alternatively
(An internet search for codecs.register_error("strict"
revealed that apparently it's used in some real projects.)
/edit #2:
For explicit conversions I made a snippet with the help of a SO post on monkeypatching:
class PatchedUnicode(unicode):
def __init__(self, obj=None, encoding=None, *args, **kwargs):
if encoding in (None, "ascii", "646", "us-ascii"):
print("Problematic unicode() usage detected!")
super(PatchedUnicode, self).__init__(obj, encoding, *args, **kwargs)
import __builtin__
__builtin__.unicode = PatchedUnicode
This only affects explicit conversions using the unicode()
constructor directly so it's not something I need.
/edit #3:
The thread "Extension method for python built-in types!" makes me think that it might actually not be easily possible (in CPython at least).
/edit #4:
It's nice to see many good answers here, too bad I can only give out the bounty once.
In the meantime I came across a somewhat similar question, at least in the sense of what the person tried to achieve: Can I turn off implicit Python unicode conversions to find my mixed-strings bugs? Please note though that throwing an exception would not have been OK in my case. Here I was looking for something which might point me to the different locations of problematic code (e.g. by printing smth.) but not something which might exit the program or change its behavior (because this way I can prioritize what to fix).
On another note, the people working on the Mypy project (which include Guido van Rossum) might also come up with something similar helpful in the future, see the discussions at https://github.com/python/mypy/issues/1141 and more recently https://github.com/python/typing/issues/208.
/edit #5
I also came across the following but didn't have yet the time to test it: https://pypi.python.org/pypi/unicode-nazi
You can register a custom encoding which prints a message whenever it's used:
Code in ourencoding.py
:
import sys
import codecs
import traceback
# Define a function to print out a stack frame and a message:
def printWarning(s):
sys.stderr.write(s)
sys.stderr.write("\n")
l = traceback.extract_stack()
# cut off the frames pointing to printWarning and our_encode
l = traceback.format_list(l[:-2])
sys.stderr.write("".join(l))
# Define our encoding:
originalencoding = sys.getdefaultencoding()
def our_encode(s, errors='strict'):
printWarning("Default encoding used");
return (codecs.encode(s, originalencoding, errors), len(s))
def our_decode(s, errors='strict'):
printWarning("Default encoding used");
return (codecs.decode(s, originalencoding, errors), len(s))
def our_search(name):
if name == 'our_encoding':
return codecs.CodecInfo(
name='our_encoding',
encode=our_encode,
decode=our_decode);
return None
# register our search and set the default encoding:
codecs.register(our_search)
reload(sys)
sys.setdefaultencoding('our_encoding')
If you import this file at the start of our script, then you'll see warnings for implicit conversions:
#!python2
# coding: utf-8
import ourencoding
print("test 1")
a = "hello " + u"world"
print("test 2")
a = "hello ☺ " + u"world"
print("test 3")
b = u" ".join(["hello", u"☺"])
print("test 4")
c = unicode("hello ☺")
output:
test 1
test 2
Default encoding used
File "test.py", line 10, in <module>
a = "hello ☺ " + u"world"
test 3
Default encoding used
File "test.py", line 13, in <module>
b = u" ".join(["hello", u"☺"])
test 4
Default encoding used
File "test.py", line 16, in <module>
c = unicode("hello ☺")
It's not perfect as test 1 shows, if the converted string only contain ASCII characters, sometimes you won't see a warning.
What you can do is the following:
First create a custom encoding. I will call it "lascii" for "logging ASCII":
import codecs
import traceback
def lascii_encode(input,errors='strict'):
print("ENCODED:")
traceback.print_stack()
return codecs.ascii_encode(input)
def lascii_decode(input,errors='strict'):
print("DECODED:")
traceback.print_stack()
return codecs.ascii_decode(input)
class Codec(codecs.Codec):
def encode(self, input,errors='strict'):
return lascii_encode(input,errors)
def decode(self, input,errors='strict'):
return lascii_decode(input,errors)
class IncrementalEncoder(codecs.IncrementalEncoder):
def encode(self, input, final=False):
print("Incremental ENCODED:")
traceback.print_stack()
return codecs.ascii_encode(input)
class IncrementalDecoder(codecs.IncrementalDecoder):
def decode(self, input, final=False):
print("Incremental DECODED:")
traceback.print_stack()
return codecs.ascii_decode(input)
class StreamWriter(Codec,codecs.StreamWriter):
pass
class StreamReader(Codec,codecs.StreamReader):
pass
def getregentry():
return codecs.CodecInfo(
name='lascii',
encode=lascii_encode,
decode=lascii_decode,
incrementalencoder=IncrementalEncoder,
incrementaldecoder=IncrementalDecoder,
streamwriter=StreamWriter,
streamreader=StreamReader,
)
What this does is basically the same as the ASCII-codec, just that it prints a message and the current stack trace every time it encodes or decodes from unicode to lascii.
Now you need to make it available to the codecs module so that it can be found by the name "lascii". For this you need to create a search function that returns the lascii-codec when it's fed with the string "lascii". This is then registered to the codecs module:
def searchFunc(name):
if name=="lascii":
return getregentry()
else:
return None
codecs.register(searchFunc)
The last thing that is now left to do is to tell the sys module to use 'lascii' as default encoding:
import sys
reload(sys) # necessary, because sys.setdefaultencoding is deleted on start of Python
sys.setdefaultencoding('lascii')
Warning: This uses some deprecated or otherwise unrecommended features. It might not be efficient or bug-free. Do not use in production, only for testing and/or debugging.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With