Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ArrayList<Double> to double[] with 300 million entries

I'm using a java program to get some data from a DB. I then calculate some numbers and start storing them in an array. The machine I'm using has 4 gigs of RAM. Now, I don't know how many numbers there will be in advance, so I use an ArrayList<Double>. But I do know there will be roughly 300 million numbers.

So, since one double is 8 bytes a rough estimate of the memory this array will consume is 2.4 gigs (probably more because of the overheads of an ArrayList). After this, I want to calculate the median of this array and am using the org.apache.commons.math3.stat.descriptive.rank.Median library which takes as input a double[] array. So, I need to convert the ArrayList<Double> to double[].

I did see many questions where this is raised and they all mention there is no way around looping through the entire array. Now this is fine, but since they also maintain both objects in memory, this brings my memory requirements up to 4.8 gigs. Now we have a problem since the total RAM available us 4 gigs.

First of all, is my suspicion that the program will at some point give me a memory error correct (it is currently running)? And if so, how can I calculate the median without having to allocate double the memory? I want to avoid sorting the array as calculating the median is O(n).

like image 996
Rohit Pandey Avatar asked Dec 25 '22 18:12

Rohit Pandey


1 Answers

Your problem is even worse than you realize, because ArrayList<Double> is much less efficient than 8 bytes per entry. Each entry is actually an object, to which the ArrayList keeps an array of references. A Double object is probably about 12 bytes (4 bytes for some kind of type identifier, 8 bytes for the double itself), and the reference to it adds another 4, bringing the total up to 16 bytes per entry, even excluding overhead for memory management and such.

If the constraints were a little wider, you could implement your own DoubleArray that is backed by a double[] but knows how to resize itself. However, the resizing means you'll have to keep a copy of both the old and the new array in memory at the same time, also blowing your memory limit.

That still leaves a few options though:

  • Loop through the input twice; once to count the entries, once to read them into a right-sized double[]. It depends on the nature of your input whether this is possible, of course.

  • Make some assumption on the maximum input size (perhaps user-configurable), and allocate a double[] up front that is this fixed size. Use only the part of it that's filled.

  • Use float instead of double to cut memory requirements in half, at the expense of some precision.

  • Rethink your algorithm to avoid holding everything in memory at once.

like image 131
Thomas Avatar answered Dec 28 '22 11:12

Thomas