I am looking for a Python implementation of an algorithm which performs the following task:
Given two directed graphs, that may contain cycles, and their roots, produce a score to the two graphs' similarity.
(The way that Python's difflib
can perform for two sequences)
Hopefully, such an implementation exists. Otherwise, I'll try and implement an algorithm myself. In which case, what's the preferable algorithm to implement (with regard to simplicity).
The way the algorithm works is of no importance to me, though its' complexity is. Also, an algorithm which works with a different data-structure is also acceptable, as long as a graph, such as I described, can be represented with this DS.
I'll emphasize, an implemetation would be much nicer.
Edit:
It seems an isomorphism algortihm is not relevant. It was suggested that graph edit distance is more to the point, which narrows down my search to a solution that either executes graph edit distance or reduces a graph to a tree and then performs tree edit distance.
The nodes themseleves consist of a few lines of assembly code each.
Edit distance/graph isomorphism One approach to evaluating graph similarity is graph isomor- phism. Two graphs are similar if they are isomorphic [17], or one is isomorphic to a subgraph of the other , or they have isomorphic subgraphs.
It is important to note that SimRank is a general algorithm that determines only the similarity of structural context. SimRank applies to any domain where there are enough relevant relationships between objects to base at least some notion of similarity on relationships.
Similarity functions are used to measure the 'distance' between two vectors or numbers or pairs. Its a measure of how similar the two objects being measured are. The two objects are deemed to be similar if the distance between them is small, and vice-versa.
Another method is to use what is called Eigenvector Similarity. Basically, you calculate the Laplacian eigenvalues for the adjacency matrices of each of the graphs. For each graph, find the smallest k such that the sum of the k largest eigenvalues constitutes at least 90% of the sum of all of the eigenvalues. If the values of k are different between the two graphs, then use the smaller one. The similarity metric is then the sum of the squared differences between the largest k eigenvalues between the graphs. This will produce a similarity metric in the range [0, ∞), where values closer to zero are more similar.
For example, if using networkx
:
def select_k(spectrum, minimum_energy = 0.9): running_total = 0.0 total = sum(spectrum) if total == 0.0: return len(spectrum) for i in range(len(spectrum)): running_total += spectrum[i] if running_total / total >= minimum_energy: return i + 1 return len(spectrum) laplacian1 = nx.spectrum.laplacian_spectrum(graph1) laplacian2 = nx.spectrum.laplacian_spectrum(graph2) k1 = select_k(laplacian1) k2 = select_k(laplacian2) k = min(k1, k2) similarity = sum((laplacian1[:k] - laplacian2[:k])**2)
What we ended up doing is implementing an algorithm described in: "Heuristics for Chemical Compound Matching".
We're using NetworkX for representing the grap and for finding the max clique.
Edit:
Basically, you create a new graph, each node (v) representing a possible pairing of a node from graph A (a) to a node from graph B (b).
If in your application the two nodes (a,b) are either similar or not, you remove nodes (v) from the new graph which correspond to dissimilar pairings (a,b). You connect two nodes with an edge if they don't contradict each other. For example, the pairings (a,b) and (a,c) contradict each other (see article for a formal definition). You then find a clique in the new graph which has the maximum amount of nodes.
If in your application the two nodes' similarity is not binary, you give the new nodes weights within a range (say (0,1)). You can remove, heuristically, remove new nodes with similarity grades lower than a predefined threshold. You then find a clique in the new graph which has the maximum weight (the sum of the nodes' assigned weights).
Either way, you finish up by generating the similarity grade: the size / total weight of the clique divided by a function of the attributes of the original graphs (maximum/minimum/average of the sizes / weights of A and B ).
A nice feature is that the you can deduce the "source" of the similarity from the clique you found - the "stronger" pairings.
Further clarification: The constraints are application dependent. We used the approach to compare pairs of function control-flow graphs. Generally, the approach finds a matching of some nodes in the first graph to some nodes in the second graph (subgraph to subgraph). Each node in the association graph symbolizes a possible matching of a single node from the first graph to a single node in the second graph. Since eventually a clique (a subset of the nodes) is selected, an edge means two matchings don't contradict each other. To apply for a different application, you should ask what are the criteria for possible pairings (or what nodes do I create), and how does selecting one pairing influence the selection of another pairing (or how do I connect the nodes with edges).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With