Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OutOfMemory error when processing a big file in Clojure

I'm following the algo-class.org course, and one of its programming assignment provide a file with format as below:

1 2
1 5
2 535

...

There are over 5 million such lines, I want to read in the file and convert it to a vector of integer vector like this: [[1 2][1 5][2 535]...].

(defn to-int-vector [s]
    (vec (map #(Integer/parseInt %) (re-seq #"\w+" s))))    

(def ints (with-open [rdr (clojure.java.io/reader "<file>")]
               (doall (map to-int-vector (line-seq rdr)))))

So i believe in this way, I'm not holding the entire file in memory, and only generating a large integer vector. But i get OutOfMemoryError from this. I did try to generate a vector of the same size and same format by running rand-int, and that works fine.

Looks like the memory issue is caused by the temp objects generated? What is the ideal way in clojure to handle a case like this?

Update:

yes, i realize that I am hold the entire integer vector. I have raised the heap size and it now works. I'm interested that a vector and 5 million elements(10 million intergers) can take up so much memory -- I have to allocate 3g for the jvm. Is there any other way that will take the memory down?

like image 579
awh Avatar asked Apr 11 '12 00:04

awh


3 Answers

You wouldn't believe how much overhead a realized lazy seq imposes. I tested this on a 64-bit OS: it's something like 120 bytes. That's pure overhead for every lazy seq member. The vector, on the other hand, has quite a low overhead and is basically the same as a Java array, given a large enough vector. So try replacing doall with vec.

Let's also see how much memory you are spending without the overhead. You've got 5e6 pairs of integers -- that's 5e6 x 8 = 40 MB. You could save by using shorts and get a 50% saving (I repeat---that's not counting the overhead of the parent collection, and each vector instance holding the pair has its own overhead).

The next step in the saving is to use a raw array for both the outer collection and for the pairs. It could still be a very practical solution since an array is seqable and integrates quite well with the language. To do that, you'd just have to replace the two occurrences of vec with to-array.

UPDATE

The difference between Integer and Short is not that big due to both still being full-fledged objects. It would save much more to store the number pairs as primitive arrays, using short-array (or int-array) instead of to-array.

like image 146
Marko Topolnik Avatar answered Nov 04 '22 21:11

Marko Topolnik


It's hard to utilize laziness and to encapsulate with-open at the same time. Laziness is important in your case, because it makes it possible to have only the "relevant" part of the line or int-vector sequence in memory.

One solution to the problem is don't encapsulate with-open and to contain the whole line-processing logic inside the dynamic scope of a with-open form:

(with-open [rdr (clojure.java.io/reader "<file>")]
  (doseq [int-vector (map to-int-vector (line-seq rdr))]
    (process int-vector)))

Two important details here are that you don't store away the line and int-vector sequence, and that you only only use them within the with-open form. These details ensure that the parts of the sequence that have already been processed can be garbage collected, and that the file stream remains open during the whole processing.,

like image 31
raek Avatar answered Nov 04 '22 20:11

raek


the def in (def ints will ensure that the entire result is stored in memory
which will be at least as large as the file because the numbers on each line are stored in a collection, a vec in this case, that takes up space also.

also by default java will refuse to use all the memory in the computer, you may need to set the maxHeapSize parameter.

If you start with a new repl (not already holding any large lists) do you still run out of memory?

like image 41
Arthur Ulfeldt Avatar answered Nov 04 '22 20:11

Arthur Ulfeldt