Removing elements from lazy sequences in a large Clojure tree structure, avoiding head retention

Question

Problem Description

For processing large data structures in Clojure, lazy sequences offer a nice, idiomatic approach. One needs to be cautious to avoid head retention, though.

I struggle with processing a large tree structure like this:

                 R                                         Root
       __________|____________________
       A                   B         C, D, E, ...          1st Level Children
_______|_______     _______|_______
X Y Y ... Y X Y     X Y Y ... Y X Y                        2nd Level Children

All nodes are maps with a key :content. The value of any :content is a lazy seq with all the children of that node.
The whole tree does not fit into memory. There are too many Y items on the 2nd level.
The whole tree excluding the Y items fits into memory.

After processing the tree, I would like to end up with a new tree, where all Y nodes were removed:

           R
     ______|__________________
     A             B         C, D, E, ...
_____|___     _____|___
X X ... X     X X ... X

Example code and further explanation

;; Generating example data
;;;;;;;;;;;;;;;;;;;;;;;;;;

(defn root [content]
  {:tag :root :content content})

(defn lazy-elements [n tag content]
  (lazy-seq (repeat n {:tag tag :content content})))

(defn level-1 [content]
  (lazy-elements 3 :A content))

(defn level-2 [n]
  (concat (lazy-elements 10 :X '(:leaf))
          (lazy-elements n :Y '(:leaf))))

(defn remove-nodes [node]
  (remove #(= (:tag %) :Y) node))


;; Illustrating usage
;;;;;;;;;;;;;;;;;;;;;

;; runs and runs and runs... and eventually returns correctly
(defn valid-run []
  (->> (root (level-1 (level-2 1e8)))
       :content
       first
       :content
       remove-nodes))

;; Does not terminate properly, runs out of memory
(defn invalid-run []
  (->> (root (level-1 (level-2 1e8)))
       :content
       (map :content)       ; source of head retention
       (map remove-nodes)))

(Gist available on GitHub)

The second example will crash (depending on available memory, one might need to adjust the number of level 2 items). Mapping over the :content of all level 1 items introduces a reference which causes head retention issues when cycling through all the content items to remove the unwanted :Y items.

I was able to use data from something like valid-run, put that into a var holding mutable state, doing that for all the relevant nodes and after that stitching all the data together again. But I am very unhappy with that approach because of having to depend on mutability and having to use some quite imperative code to merge the data in the end (running through indices of lists for example).

Question

How can this be achieved in a functional, declarative style? Ideally I would want to avoid having to use mutable state as well as being too imperative (e.g. stitching together collections using indices and such).

Resources

The following articles and snippets are interesting reads about aspects of that problem:

XML manipulation in Clojure
Rich Hickey on incidental complexity of head retention
Related Clojure documentation
Head retention funkiness in a nutshell

Some more background

Eventually I need this to process large XML files. Large means >1GB and parsing that into a tree will not work on the available memory. From that XML I want to put some elements into bucket A (let's say a database table) and all the rest of the XML tree into bucket B. The XML structure of course should be retained for the extracted subtrees.

Instead of parsing the XML into a tree, I could also process the XML as an event stream, for example via data.xml/source-seq. However, this would mean loosing the XML tree semantics. Would work, but not be beautiful. But maybe there are other approaches to handling that XML in the first place.

amalloy · Accepted Answer

The problem is that your level-2 nodes all have pointers to the same in-memory lazy sequence, and then you map over that sequence multiple times. You would have the same problem if you just made valid-run process both the first and second node - the number of nodes doesn't much matter, because you blow the heap with any two nodes. In a real application, where you've read these nodes from a database or a file or whatever, they will point to different objects, which you can handle lazily, each in turn.

If you generate more representative example data (ie, the same data but without the structural sharing), you can GC each node as you process it:

(defn root' [content]
  (fn []
    {:tag :root :content (content)}))

(defn lazy-elements' [n tag content]
  (repeatedly n (fn [] {:tag tag :content (content)})))

(defn level-1' [content]
  (fn []
    (lazy-elements' 3 :A content)))

(defn level-2' [n]
  (fn []
    (concat (lazy-elements' 10 :X (fn [] '(:leaf)))
            (lazy-elements' n :Y (fn [] '(:leaf))))))

(defn remove-nodes [node]
  (remove #(= (:tag %) :Y) node))

(defn run []
  (let [root-builder (root' (level-1' (level-2' 1e8)))]
    (->> (root-builder)
         :content
         (map :content)       
         (map remove-nodes))))

user> (pprint (run))
(({:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)})
 ({:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)})
 ({:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}))

Since we're just generating example content, I've adjusted all your node builders to take, rather than an object they should store N copies of, a function they should call N times to get N distinct objects. And rather than returning a node, they return a function that, when called, produces a copy of that node; this allows them to compose just as nicely as your original versions, just needing one extra function call at the outer level. If you actually have distinct objects already, as I suspect you would in a real application, you can just use your original functions as written.

Removing elements from lazy sequences in a large Clojure tree structure, avoiding head retention

Tags:

xml

tree

clojure

lazy-evaluation

Problem Description

Example code and further explanation

Question

Resources

Some more background

Nils Blum-Oeste

1 Answers

amalloy

Recent Activity

Donate For Us

Removing elements from lazy sequences in a large Clojure tree structure, avoiding head retention

Tags:

xml

tree

clojure

lazy-evaluation

Problem Description

Example code and further explanation

Question

Resources

Some more background

Nils Blum-Oeste

1 Answers

amalloy

Related questions

Recent Activity

Donate For Us