Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving performance of a reachable vertices algorithm applied on a 3 dimensional graph

Preamble and goals

I have a working algorithm (and working code) for a custom reachable vertices algorithm with memoization on a 3 dimensional graph, but it is not performant enough. My goal is to find ways to make this procedure faster, if possible.

Description

Consider the following problem. Let a graph G contain V nodes and E directed edges between the nodes set up randomly. Let the graph be dense enough such that E -> V^2 and let the graph be able to have cycles. Let K be a subset of V, over which we wish to find the set of reachable vertices on each of the verticies in K.

Let each vertex v_i in G have S states (i = {1...S}), each state having it's own reachable vertices cache. Given this, we can imagine the graph as a three dimensional graph.

As is, this algorithm is implemented using an iterative DFS that interacts with a cache system. This cache system is simply a structure that holds a cache for each vertex state on the graph. The cache for a vertex contains the reachable vertices for that vertex at that state.

Problematic

My algorithm is critically slow for a non-negligeable set of graph inputs because it spends ~80% of the time sorting, as it was verified through profiling. This is especially evident in graphs that have cycles and E->V^2. The way I implemented it, each time I end the visit on a vertex during the search:

  1. I take the reachable vertices from the cache of the vertex and I sort and compact, so that the vertices in different states and the vertices that are duplicated might be compacted together.
  2. I then append the currently visited vertex to the parent's reachable vertices cache.
  3. I append the reachable vertices from the current vertex' cache, to the cache of its parent.
procedure EndVisit(X):
  reachable_vertices = cache_get(X)->reachable_vertices_get // N elements
  sortAndCompact(reachable_vertices)                        // K elements | K <= N

  // parent from the current DFS stack:
  parent_reachable_vertices = parent_cache_get(X)->reachable_vertices_get 
  parent_reachable_vertices.append(X)
  parent_reachable_vertices.append(reachable_vertices)

Iterative example of the algorithm and considerations

Consider the following trivial graph with cycles.

enter image description here

We wish to find the reachable vertices of vertex B (in state 1) and vertex A (in state 1). This is what happens by running my modified iterative DFS procedure (Note that the path which the DFS takes, i.e., which children it goes to, is random)

call ReachableVerticesGet(vertex: B, state: 1):
  - [b1]
  - [b1,c1]
  Call EndVisit ==> c1 {}, b1 {c1}
  - [b1]
  - [b1,a1]
  - [b1,a1,b2]
  - [b1,a1,b2,c1]
  Call EndVisit ==> c1 {}, b2 {c1}
  - [b1,a1,b2]
  - [b1,a1,b2,a1] Cycle detected, we mark it
  Call EndVisit ==> a1 {}, b2 {c1,a1}
  - [b1,a1,b2]
  Call EndVisit ==> b2 {c1,a1}, a1 {b2,c1,a1}
  - [b1,a1]
  - [b1,a1,d1]
  - [b1,a1,d1,b2]
  - [b1,a1,d1,b2,c1]
  Call EndVisit ==> c1 {}, b2 {c1,a1,c1}
  - [b1,a1,d1,b2] 
  - [b1,a1,d1,b2,a1] Cycle detected, we mark it
  Call EndVisit ==> a1 {b2,c1,a1}, b2 {c1,a1,c1,a1,b2,c1,a1}
  - [b1,a1,d1,b2]
  Call EndVisit ==> b2 {b2,c1,a1}, d1 {b2,b2,c1,a1}
  - [b1,a1,d1]
  Call EndVisit ==> d1 {b2,c1,a1}, a1 {b2,c1,a1,d1,b2,c1,a1}
  - [b1,a1]
  Call EndVisit ==> a1 {b2,c1,a1,d1}, b1 {c1,a1,b2,c1,a1,d1}
  - [b1]
  Call EndVisit ==> b1 {c1,a1,b2,d1}
  return {c1,a1,b2,d1}
 
call ReachableVerticesGet(vertex: A, state: 1):
  A's cache contains {b2,c1,a1,d1}
  return {b2,c1,a1,d1}

Now, this algorithm works. I have many tests, many of which include weird graphs with cycles and randomly constructed graphs and all of them work. There are some degenerate cases, wherein the algorithm will run for an abnormal time despite the graphs being sufficiently small, but nothing critical, the algorithm runs correctly for the vast majority of cases. The problem is that this algorithm is wildly problematic in terms of performance.

  1. First, If I remove the SortAndCompact() function called in each EndVisit(), the RAM spikes up to >32gb really fast with no intention of stopping (no memory leak present).
  2. Second, I call std::sort everytime I end the visit on a vertex, and in the context of graphs that have cycles and vertices that have multiple states, like the one I gave in this example, sometimes the sort is called multiple times for the same vertex. Of course, I would love to not call std::sort since it's NlogN, but the more I stare at graphs for the problem I'm trying to solve, the more std::sort made sense at the time. There must exist better solutions, nevertheless.

What I have tried so far

  • The most obvious is to use a set/unordered_set instead of a vector for the reachable vertices cache at each vertex. However, for my use cases, this does not give any reasonable improvement. I even tried using custom hash functions. I profiled and it seems that the time is migrated from std::sort to std::set::insert.
  • I tried parallelizing the algorithm by cloning the caches, but then merging the caches was a nightmare and overall I dropped the investigation on this solution, there's too much shared state.
  • I tried using std::pmr (polymorphic resource type) to use an unordered_map on the stack every time I want to do an std::sort. The map is used to compact the duplicates and replaces std::sort. This gives ~25% improvement in many cases, but still not satisfactory in my opinion.
  • I tried caching only the source vertices from which we start the reachable vertices algorithm, instead of caching all the reachable vertices for each intermediary vertex dynamically. This made my workflows slower since in some cases I can have 500k+ cache hits. It is important that I cache as much as possible.
  • I considered running a polling system that accumulates reachable vertices caches, and invokes SortAndCompact() on them when the system RAM reaches a certain threshold. This was a big hit on performance and I dropped the idea as well.
  • Most of the input graphs that I use for my application have strongly connected components, I tried to do some preprocessing that would set vertex priorities on the DFS accordingly, but without quantifiable improvement.
  • I considered rewriting everything from scratch to implement a new algorithm, but this would take a lot of time, I opted to look for ways to improve the performance instead.
  • I tried a wide range of different things and at this point I've been on this for many months.

I tried many tricks overall involving

  1. reducing the calls to std::sort
  2. eliminating the calls to std::sort
  3. improving the speed to the calls to std::sort

but I'm at loss of ideas. I fear that I may have to replace this reachability algorithm all together but I'm looking for a second opinion before I drop this all together.

like image 411
hexaquark Avatar asked Nov 15 '25 10:11

hexaquark


1 Answers

For the duplicate-removal part,

First, If I remove the SortAndCompact() function called in each EndVisit(), the RAM spikes up to >32gb really fast with no intention of stopping

if you are ok to lose few GB of memory already, prepare a mask of 4 billion bits. For each bit, you can keep track of a unique value that is present in the list:

for each element of vector (this is O(N))
    mask[value_of_element]=1  ----> maximum 4GB used for 32-bit
compact(indices of bits that are set); (this is O(N))

no sort required. If indices are not very chaotic, CPU cache can increase the performance. It only requires to scan all targets at once, because compacting 4 billion elements adds a high overhead.

After you use the bits, clear only the set bits using the result vector so it will be ready for next use.


If I understand correctly, each node state has connections to:

  • neighboring nodes in same state
  • its own clones with different states (3rd dimension)

so on every movement, they can only do two things: stay in same adjacency matrix or go to another adjacency matrix's same x,y element.

Caching individual "visited" status of matrix elements can be better than caching long lists of "reachables" if you are only computing 1 node's reachables. Because caching of matrix elements is constant bytes per access while caching of reachables is variable amount of bytes.

  • select node
  • mark it "visited" through a cache (cache.set(i,j,k,true))
  • visit a connected node of it if its not visited (!cache.get(i,j,k))
  • add it to its list of reachables
  • repeat for all its connections until it doesn't have any unvisited connection (leaf nodes reached)
  • no sorting required unless you need the output sorted for something else

For example:

starting node: b1
b1->b2 --> add b2 to list
b1->b2->c1 --> add c1 to list
b1->a1 --> add a1 to list
b1->a1->a2 --> add a2 to list

only visiting non-visited nodes, by caching the visited status rather than other reachables. This should also be easier to parallelize as locking matrix elements could be easy and independent, between multiple threads running different paths (N threads = N times memory requirement though).

Worst case for 3 fully connected nodes with 3 states:

  • every node will have 8 reachable nodes, caching requires 8 elements per node access, a simple 4-element path can have 32 space requirement
  • adjacency matrix has 9 elements, total 27 elements on 3 matrices

Space requirement is already better with 3 nodes, should also be better for 4 and more.

With this caching, computing a node may benefit computing another node if their "visited" fields overlap within cache when they are computed serially.

like image 111
huseyin tugrul buyukisik Avatar answered Nov 17 '25 09:11

huseyin tugrul buyukisik



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!