I have written some code that finds all the paths upstream of a given reach in a dendritic stream network. As an example, if I represent the following network:
4 -- 5 -- 8
/
2 --- 6 - 9 -- 10
/ \
1 -- 11
\
3 ----7
as a set of parent-child pairs:
{(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3)}
it will return all of the paths upstream of a node, for instance:
get_paths(h, 1) # edited, had 11 instead of 1 in before
[[Reach(2), Reach(6), Reach(9), Reach(11)], [Reach(2), Reach(6), Reach(9), Reach(10)], [Reach(2), Reach(4), Reach(5), Reach(8)], [Reach(3), Reach(7)]]
The code is included below.
My question is: I am applying this to every reach in a very large (e.g., New England) region for which any given reach may have millions of paths. There's probably no way to avoid this being a very long operation, but is there a pythonic way to perform this operation such that brand new paths aren't generated with each run?
For example, if I run get_paths(h, 2) and all paths upstream from 2 are found, can I later run get_paths(h, 1) without retracing all of the paths in 2?
import collections
# Object representing a stream reach. Used to construct a hierarchy for accumulation function
class Reach(object):
def __init__(self):
self.name = None
self.ds = None
self.us = set()
def __repr__(self):
return "Reach({})".format(self.name)
def build_hierarchy(flows):
hierarchy = collections.defaultdict(lambda: Reach())
for reach_id, parent in flows:
if reach_id:
hierarchy[reach_id].name = reach_id
hierarchy[parent].name = parent
hierarchy[reach_id].ds = hierarchy[parent]
hierarchy[parent].us.add(hierarchy[reach_id])
return hierarchy
def get_paths(h, start_node):
def go_up(n):
if not h[n].us:
paths.append(current_path[:])
for us in h[n].us:
current_path.append(us)
go_up(us.name)
if current_path:
current_path.pop()
paths = []
current_path = []
go_up(start_node)
return paths
test_tree = {(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3)}
h = build_hierarchy(test_tree)
p = get_paths(h, 1)
EDIT: A few weeks ago I asked a similar question about finding "ALL" upstream reaches in a network and received an excellent answer that was very fast:
class Node(object):
def __init__(self):
self.name = None
self.parent = None
self.children = set()
self._upstream = set()
def __repr__(self):
return "Node({})".format(self.name)
@property
def upstream(self):
if self._upstream:
return self._upstream
else:
for child in self.children:
self._upstream.add(child)
self._upstream |= child.upstream
return self._upstream
import collections
edges = {(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3)}
nodes = collections.defaultdict(lambda: Node())
for node, parent in edges:
nodes[node].name = node
nodes[parent].name = parent
nodes[node].parent = nodes[parent]
nodes[parent].children.add(nodes[node])
I noticed that the def upstream(): part of this code adds upstream nodes in sequential order, but because it's an iterative function I can't find a good way to append them to a single list. Perhaps there is a way to modify this code that preserves the order.
A* pathfinding algorithm is arguably the best pathfinding algorithm when we have to find the shortest path between two nodes. A* is the golden ticket, or industry standard, that everyone uses. Dijkstra's Algorithm works well to find the shortest path, but it wastes time exploring in directions that aren't promising.
A* is an informed search algorithm, or a best-first search, meaning that it is formulated in terms of weighted graphs: starting from a specific starting node of a graph, it aims to find a path to the given goal node having the smallest cost (least distance travelled, shortest time, etc.).
Yes, you can do this. I'm not fully sure what your constraints are; however, this should get you on the right track. The worst case run time of this is O(|E|+|V|), with the only difference being that in p.dfsh
, we are caching previously evaluated paths, as opposed to p.dfs
, we are not.
This will add additional space overhead, so be aware of that tradeoff – you'll save many iterations (depending on your data set) at the expensive of more memory taken up no matter what. Unfortunately, the caching doesn't improve the order of growth, only the practical run time:
points = set([
(11, 9),
(10, 9),
(9, 6),
(6, 2),
(8, 5),
(5, 4),
(4, 2),
(2, 1),
(3, 1),
(7, 3),
])
class PathFinder(object):
def __init__(self, points):
self.graph = self._make_graph(points)
self.hierarchy = {}
def _make_graph(self, points):
graph = {}
for p in points:
p0, p1 = p[0], p[1]
less, more = min(p), max(p)
if less not in graph:
graph[less] = set([more])
else:
graph[less].add(more)
return graph
def dfs(self, start):
visited = set()
stack = [start]
_count = 0
while stack:
_count += 1
vertex = stack.pop()
if vertex not in visited:
visited.add(vertex)
if vertex in self.graph:
stack.extend(v for v in self.graph[vertex])
print "Start: {s} | Count: {c} |".format(c=_count, s=start),
return visited
def dfsh(self, start):
visited = set()
stack = [start]
_count = 0
while stack:
_count += 1
vertex = stack.pop()
if vertex not in visited:
if vertex in self.hierarchy:
visited.update(self.hierarchy[vertex])
else:
visited.add(vertex)
if vertex in self.graph:
stack.extend([v for v in self.graph[vertex]])
self.hierarchy[start] = visited
print "Start: {s} | Count: {c} |".format(c=_count, s=start),
return visited
p = PathFinder(points)
print p.dfsh(1)
print p.dfsh(2)
print p.dfsh(9)
print p.dfsh(6)
print p.dfsh(2)
print
print p.dfs(1)
print p.dfs(2)
print p.dfs(9)
print p.dfs(6)
print p.dfs(2)
The output for p.dfsh
this is the following:
Start: 1 | Count: 11 | set([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Start: 2 | Count: 8 | set([2, 4, 5, 6, 8, 9, 10, 11])
Start: 9 | Count: 3 | set([9, 10, 11])
Start: 6 | Count: 2 | set([9, 10, 11, 6])
Start: 2 | Count: 1 | set([2, 4, 5, 6, 8, 9, 10, 11])
The output for just the regular p.dfs
is:
Start: 1 | Count: 11 | set([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Start: 2 | Count: 8 | set([2, 4, 5, 6, 8, 9, 10, 11])
Start: 9 | Count: 3 | set([9, 10, 11])
Start: 6 | Count: 4 | set([9, 10, 11, 6])
Start: 2 | Count: 8 | set([2, 4, 5, 6, 8, 9, 10, 11])
As you can see, I do a DFS, but I keep track of previous iterations, within reason. I don't want to keep track of all possible previous paths because if you're using this on a large data set, it would take up ridiculous amounts of memory.
In the output, you can see the iteration count for p.dfsh(2)
go from 8 to 1. And likewise the count for p.dfsh(6)
is also dropped to 2 because of the previous computation of p.dfsh(9)
. This is a modest runtime improvement from the standard DFS, especially on significantly large data sets.
Sure, assuming you have enough memory to store all the paths from each node, you can just use a straightforward modification of the code you've received in that answer:
class Reach(object):
def __init__(self):
self.name = None
self.ds = None
self.us = set()
self._paths = []
def __repr__(self):
return "Reach({})".format(self.name)
@property
def paths(self):
if not self._paths:
for child in self.us:
if child.paths:
self._paths.extend([child] + path for path in child.paths)
else:
self._paths.append([child])
return self._paths
Mind you, that for some 20,000 reaches, the required memory for that approach will be in the order of gigabytes. Required memory, assuming generally balanced tree of reaches, is O(n^2), where n is the total number of reaches. That would be 4-8 GiB for 20,000 reaches depending on your system. Required time is O(1) for any node though, after the paths from h[1]
have been computed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With