For processing large data structures in Clojure, lazy sequences offer a nice, idiomatic approach. One needs to be cautious to avoid head retention, though.
I struggle with processing a large tree structure like this:
R Root
__________|____________________
A B C, D, E, ... 1st Level Children
_______|_______ _______|_______
X Y Y ... Y X Y X Y Y ... Y X Y 2nd Level Children
:content
. The value of any :content
is a lazy seq with all the children of that node.Y
items on the 2nd level.Y
items fits into memory.After processing the tree, I would like to end up with a new tree, where all Y
nodes were removed:
R
______|__________________
A B C, D, E, ...
_____|___ _____|___
X X ... X X X ... X
;; Generating example data
;;;;;;;;;;;;;;;;;;;;;;;;;;
(defn root [content]
{:tag :root :content content})
(defn lazy-elements [n tag content]
(lazy-seq (repeat n {:tag tag :content content})))
(defn level-1 [content]
(lazy-elements 3 :A content))
(defn level-2 [n]
(concat (lazy-elements 10 :X '(:leaf))
(lazy-elements n :Y '(:leaf))))
(defn remove-nodes [node]
(remove #(= (:tag %) :Y) node))
;; Illustrating usage
;;;;;;;;;;;;;;;;;;;;;
;; runs and runs and runs... and eventually returns correctly
(defn valid-run []
(->> (root (level-1 (level-2 1e8)))
:content
first
:content
remove-nodes))
;; Does not terminate properly, runs out of memory
(defn invalid-run []
(->> (root (level-1 (level-2 1e8)))
:content
(map :content) ; source of head retention
(map remove-nodes)))
(Gist available on GitHub)
The second example will crash (depending on available memory, one might need to adjust the number of level 2 items). Mapping over the
:content
of all level 1 items introduces a reference which causes head retention issues when cycling through all the content items to
remove the unwanted :Y
items.
I was able to use data from something like valid-run
, put that into a var holding mutable state, doing that for all the relevant nodes
and after that stitching all the data together again. But I am very unhappy with that approach because of having to depend on mutability
and having to use some quite imperative code to merge the data in the end (running through indices of lists for example).
How can this be achieved in a functional, declarative style? Ideally I would want to avoid having to use mutable state as well as being too imperative (e.g. stitching together collections using indices and such).
The following articles and snippets are interesting reads about aspects of that problem:
Eventually I need this to process large XML files. Large means >1GB and parsing that into a tree will not work on the available memory. From that XML I want to put some elements into bucket A (let's say a database table) and all the rest of the XML tree into bucket B. The XML structure of course should be retained for the extracted subtrees.
Instead of parsing the XML into a tree, I could also process the XML as an event stream, for example via data.xml/source-seq. However, this would mean loosing the XML tree semantics. Would work, but not be beautiful. But maybe there are other approaches to handling that XML in the first place.
The problem is that your level-2
nodes all have pointers to the same in-memory lazy sequence, and then you map over that sequence multiple times. You would have the same problem if you just made valid-run
process both the first and second node - the number of nodes doesn't much matter, because you blow the heap with any two nodes. In a real application, where you've read these nodes from a database or a file or whatever, they will point to different objects, which you can handle lazily, each in turn.
If you generate more representative example data (ie, the same data but without the structural sharing), you can GC each node as you process it:
(defn root' [content]
(fn []
{:tag :root :content (content)}))
(defn lazy-elements' [n tag content]
(repeatedly n (fn [] {:tag tag :content (content)})))
(defn level-1' [content]
(fn []
(lazy-elements' 3 :A content)))
(defn level-2' [n]
(fn []
(concat (lazy-elements' 10 :X (fn [] '(:leaf)))
(lazy-elements' n :Y (fn [] '(:leaf))))))
(defn remove-nodes [node]
(remove #(= (:tag %) :Y) node))
(defn run []
(let [root-builder (root' (level-1' (level-2' 1e8)))]
(->> (root-builder)
:content
(map :content)
(map remove-nodes))))
user> (pprint (run))
(({:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)})
({:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)})
({:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}
{:tag :X, :content (:leaf)}))
Since we're just generating example content, I've adjusted all your node builders to take, rather than an object they should store N copies of, a function they should call N times to get N distinct objects. And rather than returning a node, they return a function that, when called, produces a copy of that node; this allows them to compose just as nicely as your original versions, just needing one extra function call at the outer level. If you actually have distinct objects already, as I suspect you would in a real application, you can just use your original functions as written.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With