I know how function comparison works in Python 3 (just comparing address in memory), and I understand why.
I also understand that "true" comparison (do functions f
and g
return the same result given the same arguments, for any arguments?) is practically impossible.
I am looking for something in between. I want the comparison to work on the simplest cases of identical functions, and possibly some less trivial ones:
lambda x : x == lambda x : x # True
lambda x : 2 * x == lambda y : 2 * y # True
lambda x : 2 * x == lambda x : x * 2 # True or False is fine, but must be stable
lambda x : 2 * x == lambda x : x + x # True or False is fine, but must be stable
Note that I'm interested in solving this problem for anonymous functions (lambda
), but wouldn't mind if the solution also works for named functions.
The motivation for this is that inside blist
module, it would be nice to verify that two sortedset
instances have the same sort function before performing a union, etc. on them.
Named functions are of less interest because I can assume them to be different when they are not identical. After all, suppose someone created two sortedsets with a named function in the key
argument. If they intend these instances to be "compatible" for the purposes of set operations, they'd probably use the same function, rather than two separate named functions that perform identical operations.
I can only think of three approaches. All of them seem hard, so any ideas appreciated.
Comparing bytecodes might work but it might be annoying that it's implementation dependent (and hence the code that worked on one Python breaks on another).
Comparing tokenized source code seems reasonable and portable. Of course, it's less powerful (since identical functions are more likely to be rejected).
A solid heuristic borrowed from some symbolic computation textbook is theoretically the best approach. It might seem too heavy for my purpose, but it actually could be a good fit since lambda functions are usually tiny and so it would run fast.
EDIT
A more complicated example, based on the comment by @delnan:
# global variable
fields = ['id', 'name']
def my_function():
global fields
s1 = sortedset(key = lambda x : x[fields[0].lower()])
# some intervening code here
# ...
s2 = sortedset(key = lambda x : x[fields[0].lower()])
Would I expect the key functions for s1
and s2
to evaluate as equal?
If the intervening code contains any function call at all, the value of fields
may be modified, resulting in different key functions for s1
and s2
. Since we clearly won't be doing control flow analysis to solve this problem, it's clear that we have to evaluate these two lambda functions as different, if we are trying to perform this evaluation before runtime. (Even if fields
wasn't global, it might have been had another name bound to it, etc.) This would severely curtail the usefulness of this whole exercise, since few lambda functions would have no dependence on the environment.
EDIT 2:
I realized it's very important to compare the function objects as they exist in runtime. Without that, all the functions that depend on variables from outer scope cannot be compared; and most useful functions do have such dependencies. Considered in runtime, all functions with the same signature are comparable in a clean, logical way, regardless of what they depend on, whether they are impure, etc.
As a result, I need not just the bytecode but also the global state as of the time the function object was created (presumably __globals__
). Then I have to match all variables from outer scope to the values from __globals__
.
Edited to check whether external state will affect the sorting function as well as if the two functions are equivalent.
I hacked up dis.dis
and friends to output to a global file-like object. I then stripped out line numbers and normalized variable names (without touching constants) and compared the result.
You could clean this up so dis.dis
and friends yield
ed out lines so you wouldn't have to trap their output. But this is a working proof-of-concept for using dis.dis
for function comparison with minimal changes.
import types
from opcode import *
_have_code = (types.MethodType, types.FunctionType, types.CodeType,
types.ClassType, type)
def dis(x):
"""Disassemble classes, methods, functions, or code.
With no argument, disassemble the last traceback.
"""
if isinstance(x, types.InstanceType):
x = x.__class__
if hasattr(x, 'im_func'):
x = x.im_func
if hasattr(x, 'func_code'):
x = x.func_code
if hasattr(x, '__dict__'):
items = x.__dict__.items()
items.sort()
for name, x1 in items:
if isinstance(x1, _have_code):
print >> out, "Disassembly of %s:" % name
try:
dis(x1)
except TypeError, msg:
print >> out, "Sorry:", msg
print >> out
elif hasattr(x, 'co_code'):
disassemble(x)
elif isinstance(x, str):
disassemble_string(x)
else:
raise TypeError, \
"don't know how to disassemble %s objects" % \
type(x).__name__
def disassemble(co, lasti=-1):
"""Disassemble a code object."""
code = co.co_code
labels = findlabels(code)
linestarts = dict(findlinestarts(co))
n = len(code)
i = 0
extended_arg = 0
free = None
while i < n:
c = code[i]
op = ord(c)
if i in linestarts:
if i > 0:
print >> out
print >> out, "%3d" % linestarts[i],
else:
print >> out, ' ',
if i == lasti: print >> out, '-->',
else: print >> out, ' ',
if i in labels: print >> out, '>>',
else: print >> out, ' ',
print >> out, repr(i).rjust(4),
print >> out, opname[op].ljust(20),
i = i+1
if op >= HAVE_ARGUMENT:
oparg = ord(code[i]) + ord(code[i+1])*256 + extended_arg
extended_arg = 0
i = i+2
if op == EXTENDED_ARG:
extended_arg = oparg*65536L
print >> out, repr(oparg).rjust(5),
if op in hasconst:
print >> out, '(' + repr(co.co_consts[oparg]) + ')',
elif op in hasname:
print >> out, '(' + co.co_names[oparg] + ')',
elif op in hasjrel:
print >> out, '(to ' + repr(i + oparg) + ')',
elif op in haslocal:
print >> out, '(' + co.co_varnames[oparg] + ')',
elif op in hascompare:
print >> out, '(' + cmp_op[oparg] + ')',
elif op in hasfree:
if free is None:
free = co.co_cellvars + co.co_freevars
print >> out, '(' + free[oparg] + ')',
print >> out
def disassemble_string(code, lasti=-1, varnames=None, names=None,
constants=None):
labels = findlabels(code)
n = len(code)
i = 0
while i < n:
c = code[i]
op = ord(c)
if i == lasti: print >> out, '-->',
else: print >> out, ' ',
if i in labels: print >> out, '>>',
else: print >> out, ' ',
print >> out, repr(i).rjust(4),
print >> out, opname[op].ljust(15),
i = i+1
if op >= HAVE_ARGUMENT:
oparg = ord(code[i]) + ord(code[i+1])*256
i = i+2
print >> out, repr(oparg).rjust(5),
if op in hasconst:
if constants:
print >> out, '(' + repr(constants[oparg]) + ')',
else:
print >> out, '(%d)'%oparg,
elif op in hasname:
if names is not None:
print >> out, '(' + names[oparg] + ')',
else:
print >> out, '(%d)'%oparg,
elif op in hasjrel:
print >> out, '(to ' + repr(i + oparg) + ')',
elif op in haslocal:
if varnames:
print >> out, '(' + varnames[oparg] + ')',
else:
print >> out, '(%d)' % oparg,
elif op in hascompare:
print >> out, '(' + cmp_op[oparg] + ')',
print >> out
def findlabels(code):
"""Detect all offsets in a byte code which are jump targets.
Return the list of offsets.
"""
labels = []
n = len(code)
i = 0
while i < n:
c = code[i]
op = ord(c)
i = i+1
if op >= HAVE_ARGUMENT:
oparg = ord(code[i]) + ord(code[i+1])*256
i = i+2
label = -1
if op in hasjrel:
label = i+oparg
elif op in hasjabs:
label = oparg
if label >= 0:
if label not in labels:
labels.append(label)
return labels
def findlinestarts(code):
"""Find the offsets in a byte code which are start of lines in the source.
Generate pairs (offset, lineno) as described in Python/compile.c.
"""
byte_increments = [ord(c) for c in code.co_lnotab[0::2]]
line_increments = [ord(c) for c in code.co_lnotab[1::2]]
lastlineno = None
lineno = code.co_firstlineno
addr = 0
for byte_incr, line_incr in zip(byte_increments, line_increments):
if byte_incr:
if lineno != lastlineno:
yield (addr, lineno)
lastlineno = lineno
addr += byte_incr
lineno += line_incr
if lineno != lastlineno:
yield (addr, lineno)
class FakeFile(object):
def __init__(self):
self.store = []
def write(self, data):
self.store.append(data)
a = lambda x : x
b = lambda x : x # True
c = lambda x : 2 * x
d = lambda y : 2 * y # True
e = lambda x : 2 * x
f = lambda x : x * 2 # True or False is fine, but must be stable
g = lambda x : 2 * x
h = lambda x : x + x # True or False is fine, but must be stable
funcs = a, b, c, d, e, f, g, h
outs = []
for func in funcs:
out = FakeFile()
dis(func)
outs.append(out.store)
import ast
def outfilter(out):
for i in out:
if i.strip().isdigit():
continue
if '(' in i:
try:
ast.literal_eval(i)
except ValueError:
i = "(x)"
yield i
processed_outs = [(out, 'LOAD_GLOBAL' in out or 'LOAD_DECREF' in out)
for out in (''.join(outfilter(out)) for out in outs)]
for (out1, polluted1), (out2, polluted2) in zip(processed_outs[::2], processed_outs[1::2]):
print 'Bytecode Equivalent:', out1 == out2, '\nPolluted by state:', polluted1 or polluted2
The output is True
, True
, False
, and False
and is stable. The "Polluted" bool is true if the output will depend on external state -- either global state or a closure.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With