Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient mass modification of persistent data structures

I understand how typically trees are used to modify persistent data structures (create a new node and replace all it's ancestors).

But what if I have a tree of 10,000's of nodes and I need to modify 1000's of them? I don't want to go through and create 1000's of new roots, I only need the one new root that results from modifying everything at once.

For example: Let's take a persistent binary tree for example. In the single update node case, it does a search until it finds the node, creates a new one with the modifications and the old children, and creates new ancestors up to the root.

In the bulk update case could we do: Instead of just updating a single node, you're going to update 1000 nodes on it in one pass.

At the root node, the current list is the full list. You then split that list between those that match the left node and those that match the right. If none match one of the children, don't descend to it. You then descend to the left node (assuming there were matches), split its search list between its children, and continue. When you have a single node and a match, you update it and go back up, replacing and updating ancestors and other branches as appropriate.

This would result in only one new root even though it modified any number of nodes.

like image 806
mentics Avatar asked May 26 '11 17:05

mentics


2 Answers

These kind of "mass modification" operations are sometimes called bulk updates. Of course, the details will vary depending on exactly what kind of data structure you are working with and what kind of modifications you are trying to perform.

Typical kinds of operations might include "delete all values satisfying some condition" or "increment the values associated with all the keys in this list". Frequently, these operations can be performed in a single walk over the entire structure, taking O(n) time.

You seem to be concerned about the memory allocation involved in creating "1000's of new roots". Typical allocation for performing the operations one at a time would be O(k log n), where k is the number of nodes being modified. Typical allocation for performing the single walk over the entire structure would be O(n). Which is better depends on k and n.

In some cases, you can decrease the amount of allocation--at the cost of more complicated code--by paying special attention to when changes occur. For example, if you have a recursive algorithm that returns a tree, you might modify the algorithm to return a tree together with a boolean indicating whether anything has changed. The algorithm could then check those booleans before allocating a new node to see whether the old node can safely be reused. However, people don't usually bother with this extra check unless and until they have evidence that the extra memory allocation is actually a problem.

like image 174
Chris Okasaki Avatar answered Sep 18 '22 09:09

Chris Okasaki


A particular implementation of what you're looking for can be found in Clojure's (and ClojureScript's) transients.

In short, given a fully-immutable, persistent data structure, a transient version of it will make changes using destructive (allocation-efficient) mutation, which you can flip back into a proper persistent data structure again when you're done with your performance-sensitive operations. It is only at the transition back to a persistent data structure that new roots are created (for example), thus amortizing the attendant cost over the number of logical operations you performed on the structure while it was in its transient form.

like image 45
cemerick Avatar answered Sep 19 '22 09:09

cemerick