I'm following the algo-class.org course, and one of its programming assignment provide a file with format as below:
1 2
1 5
2 535
...
There are over 5 million such lines, I want to read in the file and convert it to a vector of integer vector like this: [[1 2][1 5][2 535]...].
(defn to-int-vector [s]
(vec (map #(Integer/parseInt %) (re-seq #"\w+" s))))
(def ints (with-open [rdr (clojure.java.io/reader "<file>")]
(doall (map to-int-vector (line-seq rdr)))))
So i believe in this way, I'm not holding the entire file in memory, and only generating a large integer vector. But i get OutOfMemoryError from this. I did try to generate a vector of the same size and same format by running rand-int, and that works fine.
Looks like the memory issue is caused by the temp objects generated? What is the ideal way in clojure to handle a case like this?
Update:
yes, i realize that I am hold the entire integer vector. I have raised the heap size and it now works. I'm interested that a vector and 5 million elements(10 million intergers) can take up so much memory -- I have to allocate 3g for the jvm. Is there any other way that will take the memory down?
You wouldn't believe how much overhead a realized lazy seq imposes. I tested this on a 64-bit OS: it's something like 120 bytes. That's pure overhead for every lazy seq member. The vector, on the other hand, has quite a low overhead and is basically the same as a Java array, given a large enough vector. So try replacing doall
with vec
.
Let's also see how much memory you are spending without the overhead. You've got 5e6 pairs of integers -- that's 5e6 x 8 = 40 MB. You could save by using shorts and get a 50% saving (I repeat---that's not counting the overhead of the parent collection, and each vector instance holding the pair has its own overhead).
The next step in the saving is to use a raw array for both the outer collection and for the pairs. It could still be a very practical solution since an array is seqable and integrates quite well with the language. To do that, you'd just have to replace the two occurrences of vec
with to-array
.
UPDATE
The difference between Integer
and Short
is not that big due to both still being full-fledged objects. It would save much more to store the number pairs as primitive arrays, using short-array
(or int-array
) instead of to-array
.
It's hard to utilize laziness and to encapsulate with-open
at the same time. Laziness is important in your case, because it makes it possible to have only the "relevant" part of the line or int-vector sequence in memory.
One solution to the problem is don't encapsulate with-open
and to contain the whole line-processing logic inside the dynamic scope of a with-open
form:
(with-open [rdr (clojure.java.io/reader "<file>")]
(doseq [int-vector (map to-int-vector (line-seq rdr))]
(process int-vector)))
Two important details here are that you don't store away the line and int-vector sequence, and that you only only use them within the with-open
form. These details ensure that the parts of the sequence that have already been processed can be garbage collected, and that the file stream remains open during the whole processing.,
the def
in (def ints
will ensure that the entire result is stored in memory
which will be at least as large as the file because the numbers on each line are
stored in a collection, a vec in this case, that takes up space also.
also by default java will refuse to use all the memory in the computer, you may need to set the maxHeapSize parameter.
If you start with a new repl (not already holding any large lists) do you still run out of memory?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With