In this SO thread, I learned that keeping a reference to a seq
on a large collection will prevent the entire collection from being garbage-collected.
First, that thread is from 2009. Is this still true in "modern" Clojure (v1.4.0 or v1.5.0)?
Second, does this issue also apply to lazy sequences? For example, would (def s (drop 999 (seq (range 1000))))
allow the garbage collector to retire the first 999
elements of the sequence?
Lastly, is there a good way around this issue for large collections? In other words, if I had a vector of, say, 10 million elements, could I consume the vector in such a way that the consumed parts could be garbage collected? What about if I had a hashmap with 10 million elements?
The reason I ask is that I'm operating on fairly large data sets, and I am having to be more careful not to retain references to objects, so that the objects I don't need can be garbage collected. As it is, I'm encountering a java.lang.OutOfMemoryError: GC overhead limit exceeded
error in some cases.
It is always the case that if you "hold onto the head" of a sequence then Clojure will be forced to keep everything in memory. It doesn't have a choice: you are still keeping a reference to it.
However the "GC overhead limit reached" isn't the same as an out of memory error - It's more likely a sign that you are running a fictitious workload that is creating and discarding objects so fast that it is tricking the GC into thinking that it is overloaded.
See:
If you put an actual workload on the items being processed, I suspect you will see that this error won't happen any more. You can easily process lazy sequences that are larger than available memory in this case.
Concrete collections like vectors and hashmaps are a different matter however: these are not lazy, so must always be held completely in memory. If you have datasets larger than memory then your options include:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With