Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When should I avoid using `seq` in Clojure?

In this SO thread, I learned that keeping a reference to a seq on a large collection will prevent the entire collection from being garbage-collected.

First, that thread is from 2009. Is this still true in "modern" Clojure (v1.4.0 or v1.5.0)?

Second, does this issue also apply to lazy sequences? For example, would (def s (drop 999 (seq (range 1000)))) allow the garbage collector to retire the first 999 elements of the sequence?

Lastly, is there a good way around this issue for large collections? In other words, if I had a vector of, say, 10 million elements, could I consume the vector in such a way that the consumed parts could be garbage collected? What about if I had a hashmap with 10 million elements?

The reason I ask is that I'm operating on fairly large data sets, and I am having to be more careful not to retain references to objects, so that the objects I don't need can be garbage collected. As it is, I'm encountering a java.lang.OutOfMemoryError: GC overhead limit exceeded error in some cases.

like image 539
Jeff Terrell Ph.D. Avatar asked Feb 21 '13 14:02

Jeff Terrell Ph.D.


1 Answers

It is always the case that if you "hold onto the head" of a sequence then Clojure will be forced to keep everything in memory. It doesn't have a choice: you are still keeping a reference to it.

However the "GC overhead limit reached" isn't the same as an out of memory error - It's more likely a sign that you are running a fictitious workload that is creating and discarding objects so fast that it is tricking the GC into thinking that it is overloaded.

See:

  • GC overhead limit exceeded

If you put an actual workload on the items being processed, I suspect you will see that this error won't happen any more. You can easily process lazy sequences that are larger than available memory in this case.

Concrete collections like vectors and hashmaps are a different matter however: these are not lazy, so must always be held completely in memory. If you have datasets larger than memory then your options include:

  • Use lazy sequences and don't hold onto the head
  • Use specialised collections that support lazy loading (Datomic uses some structures like this I believe)
  • Treat the data as an event stream (using something like Storm)
  • Write custom code to partition the data into chunks and process them one at a time.
like image 135
mikera Avatar answered Oct 01 '22 17:10

mikera