I have a large (>1000) set of directed acyclic graphs with a large (>1000) set of vertices each. The vertices are labeled, the label's cardinality is small (< 30)
I want to identify (mine) substructures that appear frequently over the whole set of graphs.
The output we are looking for is a list of substructures and their (number of) appearances in the given graphs.
I have tried to look into things and (as it seems to always happen with me) the problem is NP-complete. As far as I can see gSpan is the most common algorithm to solve this problem. However, as stated above, I'm not looking for any common substructure in the graphs but only those that obey certain rules. One should be able so use that in order to reduce the search space.
Any insight on how to approach this problem?
Update: I should probably add that the aforementioned rules can be recursive up to a certain degree. For example "a vertex labeled A with at least two children labeled B, each having at least one child labeled A". The maximum recursion depth is somewhere between 1 and 10.
Update II: Pointing out that we are not searching for known or preferred substructures but mining them. There is no spoon needle.
I'm assuming in my answer that you are trying to minimize the running time and not wanting to spend an excessive amount of time writing the code to do it. One thing that I struggled with at first when learning to write highly efficient algorithms was that sometimes multiple passes can be way more efficient. In this case, I would say, fundamentally, you want to have two passes:
First, create a filter that allows you to ignore most (if not all) non-duplicated patterns. In order to do this:
On your second pass, you will need to do the "heavier" process of actually confirming matches. In order to accomplish this:
This algorithm will run very quickly on large datasets because it will significantly reduce the pressure on the appropriate cache level. There are also several enhancements that you can make in order to make it perform better in different circumstances.
A couple notes for tweaking the above:
Anyway, to sum up with a statement on the run-time complexity, I will say that it really depends. Many people ignore the fact that increasing the working set of data will cause memory accesses to not all be equal in cost. Fundamentally, the above algorithm, while highly performant, if tuned appropriately, will run very fast on a small data-set, but it really shines with much larger datasets because it allows high efficiency ways of keeping the working set of data in whatever cache level is appropriate (L1, L2, L3, RAM, local disk, local network, etc.) The complexity of the algorithm will depend on the data, but I do not believe an algorithm much faster can be created. I did leave out how you represent the subgraphs and there is work to be done there to achieve the optimal algorithm, but without knowing more, I can't determine the best way to store that information.
The reason that an algorithm can't run much faster than the one I've presented is that the first pass will require much less work to run than the second because it doesn't require branching and it is less work to do bitwise operations. We can therefore say that it adds little to the overall work we're doing. The second stage is also about as efficient as is possible. You must (barring a way to perfectly describe each possibility with a finite set of bits, which I'll explain a second) actually compare each graph feature and write the information somewhere. The only variable is how much work it is to check whether you need to do this. Checking a bit where you can arbitrarily scale the error rate towards 0% is as good as you can get.
For smallish datasets, the reason that two passes benefit you is that you may have a much larger number of bloom cardinality in a smaller amount of memory. But for really small sets of data, you might as well just use the second step and ignore the first. But, at a minimum, you'll need to store a pointer for each hash target. This means that you will need to write 32 or 64 times as much data for the same level of filtering. For small enough datasets, this doesn't matter because a read is a read and a write is a write, but for larger datasets, this can allow you to accomplish the same level of filtering while staying in a given level of cache. In cases where you must work across multiple computers or threads, the mechanism provided in this algorithm will be WAY more efficient as the data can be combined much faster and much more information about potential matches can be exchanged.
Now, lastly, as I alluded to, you may be able to get slightly better if the number of features that you check for on each iteration is reduced further. For example, if you are only checking for 32 possible labels and the number of children with a particular label in each pass (and this is bounded to 1024), you could represent this perfectly with 15 bits. If you limited the count to 255, you could store this information perfectly with 32K. In order to pull this off in your case, you'd need to use the iteration, recursion and sharding strategies that I mentioned above and you'd need to then also track the source graph, and some other information. I honestly doubt this would work well except in very limited situations, but I'm including it for completeness.
Anyway, this is my first answer on Stack Overflow, so don't be too hard on me. I hope this was helpful!
They way I read your question, you may want something like the code below. It finds all matching subgraphs in a DAG in linear time. It doesn't support filters, but you can check the results after they are found, and filter them manually. It also may find graphs with some parts collapsed. Say you are looking for a tree a((b|c)|(c|d))
, then it might find a subgraph, where the c
node is shared between the two subtrees. Again, you can inspect the output and filter out results like that. Doing such an inspection is of course only possible if the output size is not too large. For that you will have to do some experiments on your own graphs.
from collections import namedtuple, defaultdict Node = namedtuple('Node', ['label', 'children', 'id']) # Simple tree patternA(B|AB) pattern = Node('A', (Node('B', (), 1), Node('A', (Node('B', (), 3),), 2)), 0) # Generate random DAG import random labels = 'ABCDE' dag = [] for _ in range(1000): label = random.choice(labels) children = tuple(random.sample(dag, min(len(dag)//2, 10))) dag.append(Node(label, children, len(dag))) # Helper def subtrees(pattern): yield pattern for sub in pattern.children: yield from subtrees(sub) colors = defaultdict(list) # Iterate the nodes in topologically sorted order for node in dag: # Check if the node is the head of some sub-pattern for sub in subtrees(pattern): if node.label == sub.label \ and all(any(sc.id in colors[nc.id] for nc in node.children) for sc in sub.children): # If so, color the node with that sub-pattern's color colors[node.id].append(sub.id) matches = [node for node in dag if pattern.id in colors[node.id]] print('Found {} matches.'.format(len(matches)))
I believe this is very similar to the approach Stefan Haustein had in mind.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With