Chaining data without iteration

Question

I have a large set of From/To pairs that represent a hierarchy of connected nodes. As an example, the hierarchy:

     4 -- 5 -- 8
    / 
   2 --- 6 - 9 -- 10
  /           \ 
 1              -- 11
  \
   3 ----7

is encapsulated as:

{(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3)}

I'd like to be able to create a function that returns all nodes upstream of a given node, e.g.:

nodes[2].us
> [4, 5, 6, 8, 9, 10, 11]

My actual set of nodes is in the tens of thousands, so I'd like to be able to very quickly return a list of all upstream nodes without having to perform recursion over the entire set each time I want to get an upstream set.

This is my best attempt so far, but it doesn't get beyond two levels up.

class Node:
    def __init__(self, fr, to):
        self.fr = fr
        self.to = to
        self.us = set()

def build_hierarchy(nodes):
    for node in nodes.values():
        if node.to in nodes:
            nodes[node.to].us.add(node)
    for node in nodes.values():
        for us_node in node.us.copy():
            node.us |= us_node.us
    return nodes

from_to = {(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3), (1, 0)}
nodes = {fr: Node(fr, to) for fr, to in from_to} # node objects indexed by "from"
nodes = build_hierarchy(nodes)

print [node.fr for node in nodes[2].us]
> [4, 6, 5, 9]

jme · Accepted Answer

I'll show two ways of doing this. First, we'll simply modify your us attribute to intelligently compute and cache the results of a descendant lookup. Second, we'll use a graph library, networkx.

I'd really recommend you go with the graph library if your data naturally has graph structure. You'll save yourself a lot of hassle that way.

Caching `us` nodes property

You can make your us attribute a property, and cache the results of previous lookups:

class Node(object):

    def __init__(self):
        self.name = None
        self.parent = None
        self.children = set()
        self._upstream = set()

    def __repr__(self):
        return "Node({})".format(self.name)

    @property
    def upstream(self):
        if self._upstream:
            return self._upstream
        else:
            for child in self.children:
                self._upstream.add(child)
                self._upstream |= child.upstream
            return self._upstream

Note that I'm using a slightly different representation than you. I'll create the graph:

import collections

edges = {(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3)}
nodes = collections.defaultdict(lambda: Node())

for node, parent in edges:
    nodes[node].name = node
    nodes[parent].name = parent
    nodes[node].parent = nodes[parent]
    nodes[parent].children.add(nodes[node])

and I'll lookup the upstream nodes for node 2:

>>> nodes[2].upstream
{Node(5), Node(4), Node(11), Node(9), Node(6), Node(8), Node(10)}

Once the nodes upstream of 2 are computed, they won't be recomputed if you call, for example nodes[1].upstream. If you make any changes to your graph, then, the upstream nodes will be incorrect.

Using `networkx`

If we use networkx to represent our graph, a lookup of all of the descendants of a node is very simple:

>>> import networkx as nx
>>> from_to = [(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), 
               (2, 1), (3, 1), (7, 3), (1, 0)]
>>> graph = nx.DiGraph(from_to).reverse()
>>> nx.descendants(graph, 2)
{4, 5, 6, 8, 9, 10, 11}

This doesn't fully answer your question, which seemed to be about optimizing the lookup of descendants so work wasn't repeated on subsequent calls. However, for all we know, networkx.descendants might do some intelligent caching.

So this is what I'd suggest: avoid optimizing prematurely and use the libraries. If networkx.descendants is too slow, then you might investigate the networkx code to see if it caches lookups. If not, you can build your own caching lookup using more primitive networkx functions. My bet is that networkx.descendants will work just fine, and you won't need to go through the extra work.

rmunn · Answer

Here's a function that will calculate the entire upstream list for a single node:

def upstream_nodes(start_node):
    result = []
    current = start_node
    while current.to:  # current.to == 0 means we're at the root node
        result.append(current.to)
        current = nodes[current.to]
    return result

You've said that you don't want to iterate over the entire set of nodes each time you query an upstream, but this won't: it will just query that nodes' parent, and its parent, all the way to the root. So if the node is four levels down, it will make four dictionary lookups.

Or, if you want to be really clever, here's a version that will only make each parent lookup once, then store that lookup in the Node object's .us attribute so you never have to calculate the value again. (If your nodes' parent links aren't going to change after the graph has been created, this will work -- if you change your graph, of course, it won't).

def caching_upstream_nodes(start_node, nodes):
    # start_node is the Node object whose upstream set you want
    # nodes is the dictionary you created mapping ints to Node objects
    if start_node.us:
        # We already calculated this once, no need to re-calculate
        return start_node.us
    parent = nodes.get(start_node.to)
    if parent is None:
        # We're at the root node
        start_node.us = set()
        return start_node.us
    # Otherwise, our upstream is our parent's upstream, plus the parent
    parent_upstream = caching_upstream_nodes(parent, nodes)
    start_node.us = parent_upstream.copy()
    start_node.us.add(start_node.to)
    return start_node.us

One of those two functions should be what you're looking for. (NOTE: Exercise a little bit of caution when running these, as I just wrote them but haven't invested the time to test them. I believe the algorithm is correct, but there's always a chance that I made a basic error in writing it.)

Chaining data without iteration

Tags:

python

recursion

triphook

2 Answers

Caching `us` nodes property

Using `networkx`

jme

rmunn

Recent Activity

Donate For Us

Chaining data without iteration

Tags:

python

recursion

triphook

2 Answers

Caching us nodes property

Using networkx

jme

rmunn

Related questions

Recent Activity

Donate For Us

Caching `us` nodes property

Using `networkx`