Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Guided mining of common substructures in large set of graphs

Tags:

I have a large (>1000) set of directed acyclic graphs with a large (>1000) set of vertices each. The vertices are labeled, the label's cardinality is small (< 30)

I want to identify (mine) substructures that appear frequently over the whole set of graphs.

  1. A substructure is a graph of at least two directly connected vertices with specific labels. Such a substructure may appear once or more in one or more of the given input graphs. For example "a [vertex labeled A with two directly connected children labeled B] appears twice in graph U and once in graph V".
  2. A substructure we are looking for must obey a set of pre-given rules which filter on the vertices' labels. As an example: A substructure that contains a vertex labeled A is interesting if the sub-graph is "a vertex labeled A that has at least one directly connected child labeled B and is not a directly connected sibling of a vertex labeled U or V". Substructures that do not conform to these rules may appear in the input graphs but are not of interest for the search.

The output we are looking for is a list of substructures and their (number of) appearances in the given graphs.

I have tried to look into things and (as it seems to always happen with me) the problem is NP-complete. As far as I can see gSpan is the most common algorithm to solve this problem. However, as stated above, I'm not looking for any common substructure in the graphs but only those that obey certain rules. One should be able so use that in order to reduce the search space.

Any insight on how to approach this problem?

Update: I should probably add that the aforementioned rules can be recursive up to a certain degree. For example "a vertex labeled A with at least two children labeled B, each having at least one child labeled A". The maximum recursion depth is somewhere between 1 and 10.

Update II: Pointing out that we are not searching for known or preferred substructures but mining them. There is no spoon needle.

like image 964
user2722968 Avatar asked Dec 15 '16 17:12

user2722968


Video Answer


2 Answers

I'm assuming in my answer that you are trying to minimize the running time and not wanting to spend an excessive amount of time writing the code to do it. One thing that I struggled with at first when learning to write highly efficient algorithms was that sometimes multiple passes can be way more efficient. In this case, I would say, fundamentally, you want to have two passes:

First, create a filter that allows you to ignore most (if not all) non-duplicated patterns. In order to do this:

  1. Allocate two bit arrays (and consider cache sizes when doing this). The first will be a simple bloom filter. The second will be a duplicate bloom filter.
  2. As you traverse the structure on a first pass, for each indexable structure, compute a hash value. Select the appropriate bit in your first bloom filter and set it. If that bit was already set, also set the corresponding bit in the duplicate bloom filter.

On your second pass, you will need to do the "heavier" process of actually confirming matches. In order to accomplish this:

  1. Scan over the graph again and record any structures that match the duplicate bloom filter generated in the first pass.
  2. Place those that match in a hash table (ideally using different bits from the hash computed)
  3. When a duplicate is detected, store that information off where ever you'd like to collect it.

This algorithm will run very quickly on large datasets because it will significantly reduce the pressure on the appropriate cache level. There are also several enhancements that you can make in order to make it perform better in different circumstances.

  1. In order to improve performance on a multithreaded system, it is actually safe to parallelize the first step. To do this, give each thread (or computer in a cluster) a piece of the graph. Each should compute its own copy of the two blooms. The blooms may then be combined into a final bloom. The reduction function is just (present, duplicate) = (present1 OR present2, duplicate1 OR duplicate2 OR (present1 AND present2)). This step is VERY fast.
  2. It is also completely safe to parallelize the second step, but it must be modified slightly. In order to do that, you will take the duplicate bloom filter from the first step and use that as a filter in the second step, the same as before. However, you can't complete the final comparison as easily. You must instead place the potential duplicates in hash buckets. Then, after each shard of data has been written into its own list of potential duplicate hash table, divide the data up by hash bucket and in a third step find the duplicates. Each hash bucket (from any output in the second step) must be processed by the same worker.
  3. In cases where you have a large number of structures that you are indexing, you may improve performance by recursively applying the above algorithm. The adjustment is that you use each matching category for the output from the above algorithm as your input into the recursive pass. For example, if you index only structures that have up to 5 items in the first run of the algorithm, you can, when you recurse, take each set of duplicated subgraphs and run the algorithm on only those sub-graphs. This would only be necessary with very large sets of data, obviously.
  4. Another adjustment you may consider if the graph is very large in order to increase the effectiveness of your bloom filters is to iterate on the algorithm. In the first pass, for example, you might only consider sub-graphs that have the first label as the base of the sub-graph. This would decrease the size required of your bloom filters and/or allow you to filter out more sub-graphs on the first pass.

A couple notes for tweaking the above:

  1. Consider cache sizes. For example, on an Intel Haswell chip, each core has 32K in L1 cache and 256K in L2 cache. Each cache line will contain 512 bits, so if you fill up 1% of your bloom filter, most of the cache lines will be touched. Depending on how fast other parts of the algorithm are and given that other stuff uses these caches, you can safely create a bloom filter that has up to around 512 * 1024 entries (8 entries per bit per filter = 128k, on hyperthreaded systems, that is how much L2 you get) and still maintain most of the data set in L2 cache and the really active stuff in L1. For smaller datasets, keep this number down because there is no point in making it large. If you are only flagging features as potential duplicates when they aren't less than 1% of the time, that's totally fine.
  2. Parallelizing this is, again, only really useful in cases where you have tons of data. I'm assuming that you might. If you do parallelize, you should consider the geometry. Placing partial sets of data on each computer will work with this algorithm. You can then run each iteration (in variation #4) on each computer. In cases where you have huge datasets that will avoid having to transfer all the data to all the computers.

Anyway, to sum up with a statement on the run-time complexity, I will say that it really depends. Many people ignore the fact that increasing the working set of data will cause memory accesses to not all be equal in cost. Fundamentally, the above algorithm, while highly performant, if tuned appropriately, will run very fast on a small data-set, but it really shines with much larger datasets because it allows high efficiency ways of keeping the working set of data in whatever cache level is appropriate (L1, L2, L3, RAM, local disk, local network, etc.) The complexity of the algorithm will depend on the data, but I do not believe an algorithm much faster can be created. I did leave out how you represent the subgraphs and there is work to be done there to achieve the optimal algorithm, but without knowing more, I can't determine the best way to store that information.

The reason that an algorithm can't run much faster than the one I've presented is that the first pass will require much less work to run than the second because it doesn't require branching and it is less work to do bitwise operations. We can therefore say that it adds little to the overall work we're doing. The second stage is also about as efficient as is possible. You must (barring a way to perfectly describe each possibility with a finite set of bits, which I'll explain a second) actually compare each graph feature and write the information somewhere. The only variable is how much work it is to check whether you need to do this. Checking a bit where you can arbitrarily scale the error rate towards 0% is as good as you can get.

For smallish datasets, the reason that two passes benefit you is that you may have a much larger number of bloom cardinality in a smaller amount of memory. But for really small sets of data, you might as well just use the second step and ignore the first. But, at a minimum, you'll need to store a pointer for each hash target. This means that you will need to write 32 or 64 times as much data for the same level of filtering. For small enough datasets, this doesn't matter because a read is a read and a write is a write, but for larger datasets, this can allow you to accomplish the same level of filtering while staying in a given level of cache. In cases where you must work across multiple computers or threads, the mechanism provided in this algorithm will be WAY more efficient as the data can be combined much faster and much more information about potential matches can be exchanged.

Now, lastly, as I alluded to, you may be able to get slightly better if the number of features that you check for on each iteration is reduced further. For example, if you are only checking for 32 possible labels and the number of children with a particular label in each pass (and this is bounded to 1024), you could represent this perfectly with 15 bits. If you limited the count to 255, you could store this information perfectly with 32K. In order to pull this off in your case, you'd need to use the iteration, recursion and sharding strategies that I mentioned above and you'd need to then also track the source graph, and some other information. I honestly doubt this would work well except in very limited situations, but I'm including it for completeness.

Anyway, this is my first answer on Stack Overflow, so don't be too hard on me. I hope this was helpful!

like image 174
Nick Apperson Avatar answered Sep 24 '22 03:09

Nick Apperson


They way I read your question, you may want something like the code below. It finds all matching subgraphs in a DAG in linear time. It doesn't support filters, but you can check the results after they are found, and filter them manually. It also may find graphs with some parts collapsed. Say you are looking for a tree a((b|c)|(c|d)), then it might find a subgraph, where the c node is shared between the two subtrees. Again, you can inspect the output and filter out results like that. Doing such an inspection is of course only possible if the output size is not too large. For that you will have to do some experiments on your own graphs.

from collections import namedtuple, defaultdict Node = namedtuple('Node', ['label', 'children', 'id'])  # Simple tree patternA(B|AB) pattern = Node('A', (Node('B', (), 1),                      Node('A', (Node('B', (), 3),), 2)), 0)  # Generate random DAG import random labels = 'ABCDE' dag = [] for _ in range(1000):     label = random.choice(labels)     children = tuple(random.sample(dag, min(len(dag)//2, 10)))     dag.append(Node(label, children, len(dag)))  # Helper def subtrees(pattern):     yield pattern     for sub in pattern.children:         yield from subtrees(sub)  colors = defaultdict(list) # Iterate the nodes in topologically sorted order for node in dag:     # Check if the node is the head of some sub-pattern     for sub in subtrees(pattern):         if node.label == sub.label \                 and all(any(sc.id in colors[nc.id]                     for nc in node.children) for sc in sub.children):             # If so, color the node with that sub-pattern's color             colors[node.id].append(sub.id)  matches = [node for node in dag if pattern.id in colors[node.id]] print('Found {} matches.'.format(len(matches))) 

I believe this is very similar to the approach Stefan Haustein had in mind.

like image 30
Thomas Ahle Avatar answered Sep 24 '22 03:09

Thomas Ahle