Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving scalability of the modified preorder tree traversal algorithm

I've been thinking about the modified preorder tree traversal algorithm for storing trees within a flat table (such as SQL).

One property I dislike about the standard approach is that to insert a node you have to touch (on average) N/2 of the nodes (everything with left or right higher than the insert point).

The implementations I've seen rely on sequentially numbered values. This leaves no room for updates.

This seems bad for concurrency and scaling. Imagine you have a tree rooted at the world containing user groups for every account in a large system, it's extremely large, to the point you must store subsets of the tree on different servers. Touching half of all the nodes to add a node to the bottom of the tree is bad.

Here is the idea I was considering. Basically leave room for inserts by partitioning the keyspace and dividing at each level.

Here's an example with Nmax = 64 (this would normally be the MAX_INT of your DB)

                     0:64
              ________|________
             /                 \
         1:31                   32:63
        /    \                 /     \
    2:14    15-30         33:47       48:62

Here, a node is added to the left half of the tree.

                     0:64  
              ________|________
             /                 \
         1:31                  32:63
      /   |   \               /     \
  2:11  11:20  21:30     33:47       48:62

The alogorithm must be extended for the insert and removal process to recursively renumber to the left/right indexes for the subtree. Since querying for immediate children of a node is complicated, I think it makes sense to also store the parent id in the table. The algorithm can then select the sub tree (using left > p.left && right < p.right), then use node.id and node.parent to work through the list, subdividing the indexes.

This is more complex than just incrementing all the indexes to make room for the insert (or decrementing for removal), but it has the potential to affect far fewer nodes (only decendenants of the parent of the inserted/removed node).

My question(s) are basically:

  1. Has this idea been formalized or implemented?

  2. Is this the same as nested intervals?

like image 858
Mark Renouf Avatar asked Jun 26 '09 15:06

Mark Renouf


2 Answers

I have heard of people doing this before, for the same reasons, yes.

Note that you do lose at a couple of small advantages of the algorithm by doing this

  • normally, you can tell the number of descendants of a node by ((right - left + 1) div 2). This can occasionally be useful, if e.g. you'd displaying a count in a treeview which should include the number of children to be found further down in the tree
  • Flowing from the above, it's easy to select out all leaf nodes -- WHERE (right = left + 1).

These are fairly minor advantages and may not be useful to you anyway, though for some usage patterns they're obviously handy.

That said, it does sound like materialized paths may be more useful to you, as suggested above.

like image 161
Cowan Avatar answered Nov 14 '22 00:11

Cowan


I think you're better off looking at a different way of storing trees. If your tree is broad but not terribly deep (which seems likely for the case you suggested), you can store the complete list of ancestors up to the root against each node. That way, modifying a node doesn't require touching any nodes other than the node being modified.

like image 40
Nick Johnson Avatar answered Nov 13 '22 23:11

Nick Johnson