I have a working algorithm (and working code) for a custom reachable vertices algorithm with memoization on a 3 dimensional graph, but it is not performant enough. My goal is to find ways to make this procedure faster, if possible.
Consider the following problem. Let a graph G contain V nodes and E directed edges between the nodes set up randomly. Let the graph be dense enough such that E -> V^2 and let the graph be able to have cycles. Let K be a subset of V, over which we wish to find the set of reachable vertices on each of the verticies in K.
Let each vertex
v_iinGhaveSstates(i = {1...S}), each state having it's own reachable vertices cache. Given this, we can imagine the graph as a three dimensional graph.
As is, this algorithm is implemented using an iterative DFS that interacts with a cache system. This cache system is simply a structure that holds a cache for each vertex state on the graph. The cache for a vertex contains the reachable vertices for that vertex at that state.
My algorithm is critically slow for a non-negligeable set of graph inputs because it spends ~80% of the time sorting, as it was verified through profiling. This is especially evident in graphs that have cycles and E->V^2. The way I implemented it, each time I end the visit on a vertex during the search:
procedure EndVisit(X):
reachable_vertices = cache_get(X)->reachable_vertices_get // N elements
sortAndCompact(reachable_vertices) // K elements | K <= N
// parent from the current DFS stack:
parent_reachable_vertices = parent_cache_get(X)->reachable_vertices_get
parent_reachable_vertices.append(X)
parent_reachable_vertices.append(reachable_vertices)
Consider the following trivial graph with cycles.

We wish to find the reachable vertices of vertex B (in state 1) and vertex A (in state 1). This is what happens by running my modified iterative DFS procedure (Note that the path which the DFS takes, i.e., which children it goes to, is random)
call ReachableVerticesGet(vertex: B, state: 1):
- [b1]
- [b1,c1]
Call EndVisit ==> c1 {}, b1 {c1}
- [b1]
- [b1,a1]
- [b1,a1,b2]
- [b1,a1,b2,c1]
Call EndVisit ==> c1 {}, b2 {c1}
- [b1,a1,b2]
- [b1,a1,b2,a1] Cycle detected, we mark it
Call EndVisit ==> a1 {}, b2 {c1,a1}
- [b1,a1,b2]
Call EndVisit ==> b2 {c1,a1}, a1 {b2,c1,a1}
- [b1,a1]
- [b1,a1,d1]
- [b1,a1,d1,b2]
- [b1,a1,d1,b2,c1]
Call EndVisit ==> c1 {}, b2 {c1,a1,c1}
- [b1,a1,d1,b2]
- [b1,a1,d1,b2,a1] Cycle detected, we mark it
Call EndVisit ==> a1 {b2,c1,a1}, b2 {c1,a1,c1,a1,b2,c1,a1}
- [b1,a1,d1,b2]
Call EndVisit ==> b2 {b2,c1,a1}, d1 {b2,b2,c1,a1}
- [b1,a1,d1]
Call EndVisit ==> d1 {b2,c1,a1}, a1 {b2,c1,a1,d1,b2,c1,a1}
- [b1,a1]
Call EndVisit ==> a1 {b2,c1,a1,d1}, b1 {c1,a1,b2,c1,a1,d1}
- [b1]
Call EndVisit ==> b1 {c1,a1,b2,d1}
return {c1,a1,b2,d1}
call ReachableVerticesGet(vertex: A, state: 1):
A's cache contains {b2,c1,a1,d1}
return {b2,c1,a1,d1}
Now, this algorithm works. I have many tests, many of which include weird graphs with cycles and randomly constructed graphs and all of them work. There are some degenerate cases, wherein the algorithm will run for an abnormal time despite the graphs being sufficiently small, but nothing critical, the algorithm runs correctly for the vast majority of cases. The problem is that this algorithm is wildly problematic in terms of performance.
SortAndCompact() function called in each EndVisit(), the RAM spikes up to >32gb really fast with no intention of stopping (no memory leak present).std::sort everytime I end the visit on a vertex, and in the context of graphs that have cycles and vertices that have multiple states, like the one I gave in this example, sometimes the sort is called multiple times for the same vertex. Of course, I would love to not call std::sort since it's NlogN, but the more I stare at graphs for the problem I'm trying to solve, the more std::sort made sense at the time. There must exist better solutions, nevertheless.set/unordered_set instead of a vector for the reachable vertices cache at each vertex. However, for my use cases, this does not give any reasonable improvement. I even tried using custom hash functions. I profiled and it seems that the time is migrated from std::sort to std::set::insert.std::pmr (polymorphic resource type) to use an unordered_map on the stack every time I want to do an std::sort. The map is used to compact the duplicates and replaces std::sort. This gives ~25% improvement in many cases, but still not satisfactory in my opinion.SortAndCompact() on them when the system RAM reaches a certain threshold. This was a big hit on performance and I dropped the idea as well.I tried many tricks overall involving
std::sortstd::sortstd::sortbut I'm at loss of ideas. I fear that I may have to replace this reachability algorithm all together but I'm looking for a second opinion before I drop this all together.
For the duplicate-removal part,
First, If I remove the SortAndCompact() function called in each EndVisit(), the RAM spikes up to >32gb really fast with no intention of stopping
if you are ok to lose few GB of memory already, prepare a mask of 4 billion bits. For each bit, you can keep track of a unique value that is present in the list:
for each element of vector (this is O(N))
mask[value_of_element]=1 ----> maximum 4GB used for 32-bit
compact(indices of bits that are set); (this is O(N))
no sort required. If indices are not very chaotic, CPU cache can increase the performance. It only requires to scan all targets at once, because compacting 4 billion elements adds a high overhead.
After you use the bits, clear only the set bits using the result vector so it will be ready for next use.
If I understand correctly, each node state has connections to:
so on every movement, they can only do two things: stay in same adjacency matrix or go to another adjacency matrix's same x,y element.
Caching individual "visited" status of matrix elements can be better than caching long lists of "reachables" if you are only computing 1 node's reachables. Because caching of matrix elements is constant bytes per access while caching of reachables is variable amount of bytes.
cache.set(i,j,k,true))!cache.get(i,j,k))For example:
starting node: b1
b1->b2 --> add b2 to list
b1->b2->c1 --> add c1 to list
b1->a1 --> add a1 to list
b1->a1->a2 --> add a2 to list
only visiting non-visited nodes, by caching the visited status rather than other reachables. This should also be easier to parallelize as locking matrix elements could be easy and independent, between multiple threads running different paths (N threads = N times memory requirement though).
Worst case for 3 fully connected nodes with 3 states:
Space requirement is already better with 3 nodes, should also be better for 4 and more.
With this caching, computing a node may benefit computing another node if their "visited" fields overlap within cache when they are computed serially.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With