Clojure beginner/intermediate here,
I have a large XML file (~ 240M), which I need to process lazily item by item for ETL purposes.
There is some run-processing
function, which does a lot of stuff with side-effects, db interactions, writing to logs etc.
When I apply said function to file, everything runs smoothly:
...
(with-open [source (-> "in.xml"
io/file
io/input-stream)]
(-> source
xml/parse
((fn [x]
;; runs fine
(run-processing conn config x)))))
But when I put the same function into any kind of loop (like doseq
), I get the OutOfMemoryException (GC Overhead).
...
(with-open [source (-> "in.xml"
io/file
io/input-stream)]
(-> source
xml/parse
((fn [x]
;; throws OOM GC overhead exception
(doseq [i [0]]
(run-processing conn config x))))))
I don't understand, where does the head retention happen that causes GC overhead exception? I've already tried run!
and even loop recur
instead of doseq
— same thing happens.
Must be something wrong with my run-processing
function? Then why it behaves ok when I run it directly?
Kinda confused, any help is appeciated.
To understand why your doseq
doesn't work, we first have to understand why (run-processing conn config x)
works:
The magic of Clojure here is locals clearning: It analyzes any code, and once a local binding is used the very last time, it is set to (Java) null
before running that expression. So for
(fn [x])
(run-processing conn config x))
The x
will be cleared before running run-processing
. Note: You can get the same OOM error when disabling locals clearing (a compiler option).
Now what happens when you write:
(doseq [_ [0])
(run-processing conn config x))
How should the compiler know when x
is used the very last time and clear it? I can't possibly know it: It's used within a loop. So it's never cleared and the x
will retain the head.
Note: A smart JVM implementation could possibly change this in the future when it understands that the local memory location can't be accessed by the calling function anymore and offer the binding to the garbage collector. Though, current implementations aren't that smart.
Of course it's easy to fix it: Don't use x
within a loop. Use other constructs like run!
which is just a function call and will properly clear the local before invoking run!
. Though, if you pass in the head of the seq to a function, that function will hold onto the head until the function (closure) is out of scope.
While I don't know exactly what's causing OOM, I'd like to provide some general suggestions and elaborate on our discussion in the comments.
So the sequence will be kept in memory when I use any sort of loop, but will not if I call run-processing directly? But in doseq docstring it's clearly stated that "Does not retain the head of the sequence". Then what should I do when I need to call run-processing several times (e.g. with different arguments)?
So here's our function:
(defn process-file! [conn config name]
(with-open [source (io/input-stream (io/file name))]
(-> (xml/parse source)
((fn [x]
(doseq [i [0]]
(run-processing conn config x)))))))
Where x
is a lazy-seq
(if you're using data.xml
) like:
x <- xml iterator <- file stream
If run-proccessing
is doing everything right, (fully consumes x
and returns nil
) there's nothing wrong with it—the problem is the x
binding itself. While run-processing
runs, it fully realizes the sequence x
is the head of.
(defn process-xml! [conn config x]
(run-processing conn config x)
;; X IS FULLY REALIZED IN MEMORY
(run-reporting conn config x))
(defn process-file! [conn config name]
(with-open [source (io/input-stream (io/file name))]
(->> (xml/parse source)
(process-xml! conn config))))
As you can see, we're not consuming the file item by item and immediately throwing them away—all thanks to x
. doseq
has nothing to do with this: it "does not retain the head of the sequence" it consumes, which is [0]
in our case.
This approach is not very idiomatic for two reasons:
run-processing
is doing too muchIt knows where data is coming from, in what shape, how to process it and what to do with the data. A more typical proccess-file!
would look like this:
(defn process-file! [conn config name]
(with-open [source (io/input-stream (io/file name))]
(->> (xml/parse source)
(find-item-nodes)
(map node->item)
(run! (partial process-item! conn config)))))
This is not always viable and doesn't fit every use case, but there's one more reason to do it this way.
process-file!
should (ideally) never give x
to anyoneThis one is immediately obvious from looking at your original code: it's using with-open
. query
from clojure.java.jdbc
is a good example. What it does is gets ResultSet
, maps it to pure Clojure data structures, and forces it to be fully read (with result-set-fn
of doall
) to free the connection.
Notice how it never leaks ResultSet
and the only option is to get result seq (result-set-fn
) which is a "callback": query
wants to control ResultSet
lifecycle and make sure it's closed once query
returns. Otherwise it's too easy to make a similar mistake.
(But we can if we pass it a function similar to process-xml!
as result-set-fn
.)
As I've said, I can't tell you exactly what's causing OOM. It could be:
run-processing
itself. JVM is low on memory anyway and adding a simple doseq
causes OOM. That's why I suggested slightly increasing heap size as a test.
Clojure optimizes x
binding away.
(fn [x] (run-processing conn config x))
is simply inlined by the JVM, subsequently fixing the issue with the x
binding.
So why does wrapping run-processing in doseq makes x retain head? In my examples I don't use x more than once (contrary to your "run-processing x THEN run-reporting on SAME x").
The root of the problem is not in the fact of reusing x
, it's about the sole fact of x
existing. Let's make a simple lazy-seq
:
(let [x (range 1 1e6)])
(Let's forget that range
is implemented as a Java class.)
What is x
? x
is a lazy seq head which is a recipe for constructing next value.
x = (recipe)
Let's advance it:
(let [x (range 1 1e6)
y (drop 5 x)
z (first y)])
Here are x
, y
and y
now:
x = (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (recipe)
y = (6) -> (recipe)
z = 6
Hope you can see now what I mean saying "x is a seq head and run-processing realizes it".
About "process-file! should (ideally) never give x to anyone" - correct me if I'm wrong, but doesn't mapping to pure Clojure data structures with doall makes them reside in memory, which would be bad if the file is too big (as in my case)?
process-file!
doesn't use doall
. run!
is a reduce and returns nil.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With