How do I transform a directed acyclic graph into a hash value such that any two isomorphic graphs hash to the same value? It is acceptable, but undesirable for two isomorphic graphs to hash to different values, which is what I have done in the code below. We can assume that the number of vertices in the graph is at most 11.
I am particularly interested in Python code.
Here is what I did. If self.lt
is a mapping from node to descendants (not children!), then I relabel the nodes according to a modified topological sort (that prefers to order elements with more descendants first if it can). Then, I hash the sorted dictionary. Some isomorphic graphs will hash to different values, especially as the number of nodes grows.
I have included all the code to motivate my use case. I am calculating the number of comparisons required to find the median of 7 numbers. The more that isomorphic graphs hash to the same value the less work that has to be redone. I considered putting larger connected components first, but didn't see how to do that quickly.
from tools.decorator import memoized # A standard memoization decorator
class Graph:
def __init__(self, n):
self.lt = {i: set() for i in range(n)}
def compared(self, i, j):
return j in self.lt[i] or i in self.lt[j]
def withedge(self, i, j):
retval = Graph(len(self.lt))
implied_lt = self.lt[j] | set([j])
for (s, lt_s), (k, lt_k) in zip(self.lt.items(),
retval.lt.items()):
lt_k |= lt_s
if i in lt_k or k == i:
lt_k |= implied_lt
return retval.toposort()
def toposort(self):
mapping = {}
while len(mapping) < len(self.lt):
for i, lt_i in self.lt.items():
if i in mapping:
continue
if any(i in lt_j or len(lt_i) < len(lt_j)
for j, lt_j in self.lt.items()
if j not in mapping):
continue
mapping[i] = len(mapping)
retval = Graph(0)
for i, lt_i in self.lt.items():
retval.lt[mapping[i]] = {mapping[j]
for j in lt_i}
return retval
def median_known(self):
n = len(self.lt)
for i, lt_i in self.lt.items():
if len(lt_i) != n // 2:
continue
if sum(1
for j, lt_j in self.lt.items()
if i in lt_j) == n // 2:
return True
return False
def __repr__(self):
return("[{}]".format(", ".join("{}: {{{}}}".format(
i,
", ".join(str(x) for x in lt_i))
for i, lt_i in self.lt.items())))
def hashkey(self):
return tuple(sorted({k: tuple(sorted(v))
for k, v in self.lt.items()}.items()))
def __hash__(self):
return hash(self.hashkey())
def __eq__(self, other):
return self.hashkey() == other.hashkey()
@memoized
def mincomps(g):
print("Calculating:", g)
if g.median_known():
return 0
nodes = g.lt.keys()
return 1 + min(max(mincomps(g.withedge(i, j)),
mincomps(g.withedge(j, i)))
for i in nodes
for j in nodes
if j > i and not g.compared(i, j))
g = Graph(7)
print(mincomps(g))
Acyclic – No task can create data that goes on to reference itself. That could cause an infinite loop, which could give rise to a problem or two. There are no cycles in DAGs. Graph – In mathematics, a graph is a finite set of nodes, with vertices connecting the nodes.
In a dag, there must be at least one vertex with no incoming edge, or a vertext with indegree 0.
Nodes. Each node represents some object or piece of data. In the case of a DVCS, each node represents one revision of the entire repository tree. Directed edges. A directed edge (or “arrow”) from one node to another represents some kind of relationship between those two nodes.
Since there aren't self-loops (DAG), then you have to subtract n from the triangle numbers sequence n(n+1)2−n, which gives you the maximum number of edges a DAG can have as n(n−1)2, again n = # of nodes.
To effectively test for graph isomorphism you will want to use nauty. Specifically for Python there is the wrapper pynauty, but I can't attest its quality (to compile it correctly I had to do some simple patching on its setup.py
). If this wrapper is doing everything correctly, then it simplifies nauty a lot for the uses you are interested and it is only a matter of hashing pynauty.certificate(somegraph)
-- which will be the same value for isomorphic graphs.
Some quick tests showed that pynauty
is giving the same certificate for every graph (with same amount of vertices). But that is only because of a minor issue in the wrapper when converting the graph to nauty's format. After fixing this, it works for me (I also used the graphs at http://funkybee.narod.ru/graphs.htm for comparison). Here is the short patch which also considers the modifications needed in setup.py
:
diff -ur pynauty-0.5-orig/setup.py pynauty-0.5/setup.py --- pynauty-0.5-orig/setup.py 2011-06-18 20:53:17.000000000 -0300 +++ pynauty-0.5/setup.py 2013-01-28 22:09:07.000000000 -0200 @@ -31,7 +31,9 @@ ext_pynauty = Extension( name = MODULE + '._pynauty', - sources = [ pynauty_dir + '/' + 'pynauty.c', ], + sources = [ pynauty_dir + '/' + 'pynauty.c', + os.path.join(nauty_dir, 'schreier.c'), + os.path.join(nauty_dir, 'naurng.c')], depends = [ pynauty_dir + '/' + 'pynauty.h', ], extra_compile_args = [ '-O4' ], extra_objects = [ nauty_dir + '/' + 'nauty.o', diff -ur pynauty-0.5-orig/src/pynauty.c pynauty-0.5/src/pynauty.c --- pynauty-0.5-orig/src/pynauty.c 2011-03-03 23:34:15.000000000 -0300 +++ pynauty-0.5/src/pynauty.c 2013-01-29 00:38:36.000000000 -0200 @@ -320,7 +320,7 @@ PyObject *adjlist; PyObject *p; - int i,j; + Py_ssize_t i, j; int adjlist_length; int x, y;
Graph isomorphism for directed acyclic graphs is still GI-complete. Therefore there is currently no known (worst case sub-exponential) solution to guarantee that two isomorphic directed acyclic graphs will yield the same hash. Only if the mapping between different graphs is known - for example if all vertices have unique labels - one could efficiently guarantee matching hashes.
Okay, let's brute force this for a small number of vertices. We have to find a representation of the graph that is independent of the ordering of the vertices in the input and therefore guarantees that isomorphic graphs yield the same representation. Further this representation must ensure that no two non-isomorphic graphs yield the same representation.
The simplest solution is to construct the adjacency matrix for all n! permutations of the vertices and just interpret the adjacency matrix as n2 bit integer. Then we can just pick the smallest or largest of this numbers as canonical representation. This number completely encodes the graph and therefore ensures that no two non-isomorphic graphs yield the same number - one could consider this function a perfect hash function. And because we choose the smallest or largest number encoding the graph under all possible permutations of the vertices we further ensure that isomorphic graphs yield the same representation.
How good or bad is this in the case of 11 vertices? Well, the representation will have 121 bits. We can reduce this by 11 bits because the diagonal representing loops will be all zeros in an acyclic graph and are left with 110 bits. This number could in theory be decreased further; not all 2110 remaining graphs are acyclic and for each graph there may be up to 11! - roughly 225 - isomorphic representations but in practice this might be quite hard to do. Does anybody know how to compute the number of distinct directed acyclic graphs with n vertices?
How long will it take to find this representation? Naively 11! or 39,916,800 iterations. This is not nothing and probably already impractical but I did not implement and test it. But we can probably speed this up a bit. If we interpret the adjacency matrix as integer by concatenating the rows from top to bottom left to right we want many ones (zeros) at the left of the first row to obtain a large (small) number. Therefore we pick as first vertex the one (or one of the vertices) with largest (smallest) degree (indegree or outdegree depending on the representation) and than vertices connected (not connected) to this vertex in subsequent positions to bring the ones (zeros) to the left.
There are likely more possibilities to prune the search space but I am not sure if there are enough to make this a practical solution. Maybe there are or maybe somebody else can at least build something upon this idea.
How good does the hash have to be? I assume that you do not want a full serialization of the graph. A hash rarely guarantees that there is no second (but different) element (graph) that evaluates to the same hash. If it is very important to you, that isomorphic graphs (in different representations) have the same hash, then only use values that are invariant under a change of representation. E.g.:
(indegree, outdegree) = (i,j)
for any tuple (i,j)
up to (max(indegree), max(outdegree))
(or limited for tuples up to some fixed value (m,n)
)All these informations can be gathered in O(#nodes) [assuming that the graph is stored properly]. Concatenate them and you have a hash. If you prefer you can use some well known hash algorithm like sha
on these concatenated informations. Without additional hashing it is a continuous hash (it allows to find similar graphs), with additional hashing it is uniform and fixed in size if the chosen hash algorithm has these properties.
As it is, it is already good enough to register any added or removed connection. It might miss connections that were changed though (a -> c
instead of a -> b
).
This approach is modular and can be extended as far as you like. Any additional property that is being included will reduce the number of collisions but increase the effort necessary to get the hash value. Some more ideas:
node->child->child
chain ( = second order outdegree) or respectively the number of nodes that lead to the given node in two steps.xor
of any kind of hash of all the node-contents. Due to the xor
the specific order in which the nodes where added to the hash does not matter.You requested "a unique hash value" and clearly I cannot offer you one. But I see the terms "hash" and "unique to every graph" as mutually exclusive (not entirely true of course) and decided to answer the "hash" part and not the "unique" part. A "unique hash" (perfect hash) basically needs to be a full serialization of the graph (because the amount of information stored in the hash has to reflect the total amount of information in the graph). If that is really what you want just define some unique order of nodes (eg. sorted by own outdegree, then indegree, then outdegree of children and so on until the order is unambiguous) and serialize the graph in any way (using the position in the formentioned ordering as index to the nodes).
Of course this is much more complex though.
Years ago, I created a simple and flexible algorithm for exactly this problem (finding duplicate structures in a database of chemical structures by hashing them).
I named it "Powerhash", and to create the algorithm it required two insights. The first is the power iteration graph algorithm, also used in PageRank. The second is the ability to replace power iteration's inside step function with anything that we want. I replaced it with a function that does the following on each step, and for each node:
On the first step, a node's hash is affected by its direct neighbors. On the second step, a node's hash is affected by the neighborhood 2-hops away from it. On the Nth step a node's hash will be affected by the neighborhood N-hops around it. So you only need to continue running the Powerhash for N = graph_radius steps. In the end, the graph center node's hash will have been affected by the whole graph.
To produce the final hash, sort the final step's node hashes and concatenate them together. After that, you can compare the final hashes to find if two graphs are isomorphic. If you have labels, then add them in the internal hashes that you calculate for each node (and at each step).
For more on this you can look at my post here:
https://plus.google.com/114866592715069940152/posts/fmBFhjhQcZF
The algorithm above was implemented inside the "madIS" functional relational database. You can find the source code of the algorithm here:
https://github.com/madgik/madis/blob/master/src/functions/aggregate/graph.py
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With