Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tracking down implicit unicode conversions in Python 2

I have a large project where at various places problematic implicit Unicode conversions (coersions) were used in the form of e.g.:

someDynamicStr = "bar" # could come from various sources

# works
u"foo" + someDynamicStr
u"foo{}".format(someDynamicStr)

someDynamicStr = "\xff" # uh-oh

# raises UnicodeDecodeError
u"foo" + someDynamicStr
u"foo{}".format(someDynamicStr)

(Possibly other forms as well.)

Now I would like to track down those usages, especially those in actively used code.

It would be great if I could easily replace the unicode constructor with a wrapper which checks whether the input is of type str and the encoding/errors parameters are set to the default values and then notifies me (prints traceback or such).

/edit:

While not directly related to what I am looking for I came across this gloriously horrible hack for how to make the decode exception go away altogether (the decode one only, i.e. str to unicode, but not the other way around, see https://mail.python.org/pipermail/python-list/2012-July/627506.html).

I don't plan on using it but it might be interesting for those battling problems with invalid Unicode input and looking for a quick fix (but please think about the side effects):

import codecs
codecs.register_error("strict", codecs.ignore_errors)
codecs.register_error("strict", lambda x: (u"", x.end)) # alternatively

(An internet search for codecs.register_error("strict" revealed that apparently it's used in some real projects.)

/edit #2:

For explicit conversions I made a snippet with the help of a SO post on monkeypatching:

class PatchedUnicode(unicode):
  def __init__(self, obj=None, encoding=None, *args, **kwargs):
    if encoding in (None, "ascii", "646", "us-ascii"):
        print("Problematic unicode() usage detected!")
    super(PatchedUnicode, self).__init__(obj, encoding, *args, **kwargs)

import __builtin__
__builtin__.unicode = PatchedUnicode

This only affects explicit conversions using the unicode() constructor directly so it's not something I need.

/edit #3:

The thread "Extension method for python built-in types!" makes me think that it might actually not be easily possible (in CPython at least).

/edit #4:

It's nice to see many good answers here, too bad I can only give out the bounty once.

In the meantime I came across a somewhat similar question, at least in the sense of what the person tried to achieve: Can I turn off implicit Python unicode conversions to find my mixed-strings bugs? Please note though that throwing an exception would not have been OK in my case. Here I was looking for something which might point me to the different locations of problematic code (e.g. by printing smth.) but not something which might exit the program or change its behavior (because this way I can prioritize what to fix).

On another note, the people working on the Mypy project (which include Guido van Rossum) might also come up with something similar helpful in the future, see the discussions at https://github.com/python/mypy/issues/1141 and more recently https://github.com/python/typing/issues/208.

/edit #5

I also came across the following but didn't have yet the time to test it: https://pypi.python.org/pypi/unicode-nazi

like image 943
phk Avatar asked Sep 23 '16 14:09

phk


2 Answers

You can register a custom encoding which prints a message whenever it's used:

Code in ourencoding.py:

import sys
import codecs
import traceback

# Define a function to print out a stack frame and a message:

def printWarning(s):
    sys.stderr.write(s)
    sys.stderr.write("\n")
    l = traceback.extract_stack()
    # cut off the frames pointing to printWarning and our_encode
    l = traceback.format_list(l[:-2])
    sys.stderr.write("".join(l))

# Define our encoding:

originalencoding = sys.getdefaultencoding()

def our_encode(s, errors='strict'):
    printWarning("Default encoding used");
    return (codecs.encode(s, originalencoding, errors), len(s))

def our_decode(s, errors='strict'):
    printWarning("Default encoding used");
    return (codecs.decode(s, originalencoding, errors), len(s))

def our_search(name):
    if name == 'our_encoding':
        return codecs.CodecInfo(
            name='our_encoding',
            encode=our_encode,
            decode=our_decode);
    return None

# register our search and set the default encoding:
codecs.register(our_search)
reload(sys)
sys.setdefaultencoding('our_encoding')

If you import this file at the start of our script, then you'll see warnings for implicit conversions:

#!python2
# coding: utf-8

import ourencoding

print("test 1")
a = "hello " + u"world"

print("test 2")
a = "hello ☺ " + u"world"

print("test 3")
b = u" ".join(["hello", u"☺"])

print("test 4")
c = unicode("hello ☺")

output:

test 1
test 2
Default encoding used
 File "test.py", line 10, in <module>
   a = "hello ☺ " + u"world"
test 3
Default encoding used
 File "test.py", line 13, in <module>
   b = u" ".join(["hello", u"☺"])
test 4
Default encoding used
 File "test.py", line 16, in <module>
   c = unicode("hello ☺")

It's not perfect as test 1 shows, if the converted string only contain ASCII characters, sometimes you won't see a warning.

like image 59
roeland Avatar answered Nov 15 '22 05:11

roeland


What you can do is the following:

First create a custom encoding. I will call it "lascii" for "logging ASCII":

import codecs
import traceback

def lascii_encode(input,errors='strict'):
    print("ENCODED:")
    traceback.print_stack()
    return codecs.ascii_encode(input)


def lascii_decode(input,errors='strict'):
    print("DECODED:")
    traceback.print_stack()
    return codecs.ascii_decode(input)

class Codec(codecs.Codec):
    def encode(self, input,errors='strict'):
        return lascii_encode(input,errors)
    def decode(self, input,errors='strict'):
        return lascii_decode(input,errors)

class IncrementalEncoder(codecs.IncrementalEncoder):
    def encode(self, input, final=False):
        print("Incremental ENCODED:")
        traceback.print_stack()
        return codecs.ascii_encode(input)

class IncrementalDecoder(codecs.IncrementalDecoder):
    def decode(self, input, final=False):
        print("Incremental DECODED:")
        traceback.print_stack()
        return codecs.ascii_decode(input)

class StreamWriter(Codec,codecs.StreamWriter):
    pass

class StreamReader(Codec,codecs.StreamReader):
    pass

def getregentry():
    return codecs.CodecInfo(
        name='lascii',
        encode=lascii_encode,
        decode=lascii_decode,
        incrementalencoder=IncrementalEncoder,
        incrementaldecoder=IncrementalDecoder,
        streamwriter=StreamWriter,
        streamreader=StreamReader,
    )

What this does is basically the same as the ASCII-codec, just that it prints a message and the current stack trace every time it encodes or decodes from unicode to lascii.

Now you need to make it available to the codecs module so that it can be found by the name "lascii". For this you need to create a search function that returns the lascii-codec when it's fed with the string "lascii". This is then registered to the codecs module:

def searchFunc(name):
    if name=="lascii":
        return getregentry()
    else:
        return None

codecs.register(searchFunc)

The last thing that is now left to do is to tell the sys module to use 'lascii' as default encoding:

import sys
reload(sys) # necessary, because sys.setdefaultencoding is deleted on start of Python
sys.setdefaultencoding('lascii')

Warning: This uses some deprecated or otherwise unrecommended features. It might not be efficient or bug-free. Do not use in production, only for testing and/or debugging.

like image 22
Dakkaron Avatar answered Nov 15 '22 03:11

Dakkaron