I have a large set of From/To pairs that represent a hierarchy of connected nodes. As an example, the hierarchy:
4 -- 5 -- 8
/
2 --- 6 - 9 -- 10
/ \
1 -- 11
\
3 ----7
is encapsulated as:
{(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3)}
I'd like to be able to create a function that returns all nodes upstream of a given node, e.g.:
nodes[2].us
> [4, 5, 6, 8, 9, 10, 11]
My actual set of nodes is in the tens of thousands, so I'd like to be able to very quickly return a list of all upstream nodes without having to perform recursion over the entire set each time I want to get an upstream set.
This is my best attempt so far, but it doesn't get beyond two levels up.
class Node:
def __init__(self, fr, to):
self.fr = fr
self.to = to
self.us = set()
def build_hierarchy(nodes):
for node in nodes.values():
if node.to in nodes:
nodes[node.to].us.add(node)
for node in nodes.values():
for us_node in node.us.copy():
node.us |= us_node.us
return nodes
from_to = {(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3), (1, 0)}
nodes = {fr: Node(fr, to) for fr, to in from_to} # node objects indexed by "from"
nodes = build_hierarchy(nodes)
print [node.fr for node in nodes[2].us]
> [4, 6, 5, 9]
I'll show two ways of doing this. First, we'll simply modify your us attribute to intelligently compute and cache the results of a descendant lookup. Second, we'll use a graph library, networkx.
I'd really recommend you go with the graph library if your data naturally has graph structure. You'll save yourself a lot of hassle that way.
us nodes propertyYou can make your us attribute a property, and cache the results of previous lookups:
class Node(object):
def __init__(self):
self.name = None
self.parent = None
self.children = set()
self._upstream = set()
def __repr__(self):
return "Node({})".format(self.name)
@property
def upstream(self):
if self._upstream:
return self._upstream
else:
for child in self.children:
self._upstream.add(child)
self._upstream |= child.upstream
return self._upstream
Note that I'm using a slightly different representation than you. I'll create the graph:
import collections
edges = {(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3)}
nodes = collections.defaultdict(lambda: Node())
for node, parent in edges:
nodes[node].name = node
nodes[parent].name = parent
nodes[node].parent = nodes[parent]
nodes[parent].children.add(nodes[node])
and I'll lookup the upstream nodes for node 2:
>>> nodes[2].upstream
{Node(5), Node(4), Node(11), Node(9), Node(6), Node(8), Node(10)}
Once the nodes upstream of 2 are computed, they won't be recomputed if you call, for example nodes[1].upstream. If you make any changes to your graph, then, the upstream nodes will be incorrect.
networkxIf we use networkx to represent our graph, a lookup of all of the descendants of a node is very simple:
>>> import networkx as nx
>>> from_to = [(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2),
(2, 1), (3, 1), (7, 3), (1, 0)]
>>> graph = nx.DiGraph(from_to).reverse()
>>> nx.descendants(graph, 2)
{4, 5, 6, 8, 9, 10, 11}
This doesn't fully answer your question, which seemed to be about optimizing the lookup of descendants so work wasn't repeated on subsequent calls. However, for all we know, networkx.descendants might do some intelligent caching.
So this is what I'd suggest: avoid optimizing prematurely and use the libraries. If networkx.descendants is too slow, then you might investigate the networkx code to see if it caches lookups. If not, you can build your own caching lookup using more primitive networkx functions. My bet is that networkx.descendants will work just fine, and you won't need to go through the extra work.
Here's a function that will calculate the entire upstream list for a single node:
def upstream_nodes(start_node):
result = []
current = start_node
while current.to: # current.to == 0 means we're at the root node
result.append(current.to)
current = nodes[current.to]
return result
You've said that you don't want to iterate over the entire set of nodes each time you query an upstream, but this won't: it will just query that nodes' parent, and its parent, all the way to the root. So if the node is four levels down, it will make four dictionary lookups.
Or, if you want to be really clever, here's a version that will only make each parent lookup once, then store that lookup in the Node object's .us attribute so you never have to calculate the value again. (If your nodes' parent links aren't going to change after the graph has been created, this will work -- if you change your graph, of course, it won't).
def caching_upstream_nodes(start_node, nodes):
# start_node is the Node object whose upstream set you want
# nodes is the dictionary you created mapping ints to Node objects
if start_node.us:
# We already calculated this once, no need to re-calculate
return start_node.us
parent = nodes.get(start_node.to)
if parent is None:
# We're at the root node
start_node.us = set()
return start_node.us
# Otherwise, our upstream is our parent's upstream, plus the parent
parent_upstream = caching_upstream_nodes(parent, nodes)
start_node.us = parent_upstream.copy()
start_node.us.add(start_node.to)
return start_node.us
One of those two functions should be what you're looking for. (NOTE: Exercise a little bit of caution when running these, as I just wrote them but haven't invested the time to test them. I believe the algorithm is correct, but there's always a chance that I made a basic error in writing it.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With