Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Graph Theory - How to find nodes reachable from a given node within certain cost?

Tags:

I'm considering the following problem (very rough description):

Assume we have a graph where edges are assigned some non-negative costs, a starting node s and some cost constant C. Find out:

  • A set of nodes N, reachable from s where the cost of the shortest path from the starting node s to any node in N is not greater than C.
  • For each e in the set the above - the cost of the shortest path.

Basically Dijkstra with the cost constraint.

My primary question is: what is the correct terminology in the graph theory for this problem?

I've been looking at "accessibility" or "reachability" but these seem to be the wrong keywords.

Ultimately I'm looking for an algorithm which could efficiently answer many such queries on one (non-modifable) but quite large (potentially ~100 million edges) graph in parallel. I'd like to check literature but need hints on the right keywords.

Update: My practical problem is as follows.

Assume we're given a continental-size road network (modelled as a directed graph, with some properties on edges and nodes like if it's a pedestrian way or highway). Edge cost is travel time.

I'd like to answer user queries like: starting from some given position (graph node), which nodes are reachable within 1 hour?

I'd also like to answer these queries very fast (<200-300ms, if possible) for many users (>10000, if possible) in parallel.

I think there should be at least two optimizations possible:

  • Some reasonably-sized precomputation, for instance pre-computed distance matrices for selected "central" nodes.
  • If searches are conducted in parallel, they may profit from "temporary results" of each other.

I have a few ideas on my own but I'd first like to check the literature to avoid reinventing the wheel.

like image 496
lexicore Avatar asked Jul 09 '15 23:07

lexicore


People also ask

Is a node reachable from itself?

Two nodes X and Y are said to be reachable if we can start at X and end at Y using any number of edges. Note : A Node is reachable to itself.

How do you find the node in a graph?

Description. k = findnode( G , nodeID ) returns the numeric node ID, k , of the node in graph G whose name or index is nodeID . The numeric node ID is zero if the node is not in the graph.

What is reachable node?

Note: A node 'u' is said to be reachable from node 'v', if there exists a path between node' u' and 'v'.


1 Answers

The correct terminology to the problem you are dealing with is in the family of "shortest path algorithms". Reachability problems (i.e. Warshal) deal with the question "is there a path between nodes A and B?" and it has a binary answer, you don't need weights in this case, you only look for edges. But in your case you need to take into consideration the time it takes to travel from node A to node B in every case.

Now, there is no "exact" fit for this problems because a small change on Dijktra's, Floyd's, BFS or DFS can be used to solve this problem, but you have an added complexity because of the size of the graph this is important to understand how to build your solution. Which algorithm to use depends on your specific combination of time constraints and available hardware you have.

There are 2 algorithmic solutions to your problem (I'm assuming you already have all the edges stored somewhere and that you can query this database somehow):

  1. In an ideal (imaginary) world you would run a Floyd's algorithm, save the resulting matrix in something like Redis and that's it, you can serve your requests in less than 10 ms, if the number of clients grow you can add more redis servers as needed since the graph is not likely to change very often (because in your specific problem you have road information). The problem with this is the space complexity of the solution, the best thing about it is that everything is precomputed, which means your response time to requests is minimal. To be able to implement some variation of this you need a lot of space, even a redis cluster with a DB stored on disk (yes disk, not memory) and all the servers with SSDs should be enough, this option scales well when the number of concurrent clients grow and the time response is quite fast. But whether or not you can use this solution depends on the number of nodes in your graph: even tho you need to precompute the distances using each edge you only need space to store a matrix of NxN being N the number of nodes in your graph, if this matrix fits in redis then you can use this algorithm and precompute all the "distances" (in your case this is the sum of the costs a.k.a "travel times") between all the nodes. Later on when you receive a request you need to query the resulting matrix to fetch all the nodes with a travel time less than a given value, there is an additional optimization when storing this matrix in redis than can allow you to fetch the node numbers quite fast using sorted sets.

  2. Then you have the second solution which is to modify Dijktra, BFS or DFS to prune the search based on the cost and that's it. In this scenario you have already discarded Floyd because of the high space complexity, which means that your graph is rather huge both in nodes and edges. In this case the solution is almost the same using a variation of any of the algorithms, which makes the difference is how are you storing the graph. All 3 algorithms can be modified to efficiently respond in the time you want but to support over 10k clients the database you use to store the graph makes the difference. And here you have another 2 options:

  • Using standard SQL/NoSQL database: This database should be sharded somehow since it should serve your code servers (plural, because of the C10K problem) running the searches on the graph. I would suggest you to continue your research in this area depending on the size of the graph data itself, but if you manage to put all the data in a Cassandra cluster or a similar technology it can be optimized to the response times you want and it scales really well.
  • You make use of the fact that the graph is actually a "map" and you find a way of running distance queries on the data: This potentially requires you to change the way you store the graph to add something like latitude and longitude. So instead of running the algorithm against the full graph, which is absurd, you optimize the whole process by using the fact that given a certain amount of minutes from a given node, you can translate that to a distance D in miles (approximate, you can add something like 10-20 miles to be on the safe side) and you run a query on the database to fetch all the nodes up to that distance D and then you run the algorithm of your choice against that small graph. It is important to note that the graph you pull from the database using this method already gives you an approximation of the solution if the actual travel time in the edges is somewhat proportional to the distance traveled (this is not always true, that's why it is an approximation). To implement this at a small scale you would use PostgreSQL + PostGIS which allows you to run such queries, but you are going to need to do some research here to try to find the solution that scales/works the best for you.

Hope it helps!

like image 134
Sergio Ayestarán Avatar answered Sep 19 '22 18:09

Sergio Ayestarán