Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle large data sets in Java without using too much memory

Tags:

java

I'm working in Java. I have the requirement that I must essentially compare two database queries. To do this, I take each row of the result set and assign it to a HashTable with the field name as the 'key' and the data in the field as the 'value'. I then group the entire result set of HashTables into a single Vector just as a container. So essentially to compare two queries I'm really iterating through two Vectors of HashTables.

I've come to find that this approach works really well for me but requires a lot of memory. Because of other design requirements, I have to do this comparison via a Vector-HashTable-like structure, and not some DB side procedure.

Does anyone have any suggestions for optimization? The optimal solution would be one that is somewhat similar to what I am doing now as most of the code is already designed around it.

Thanks

like image 295
Tyler Avatar asked Aug 24 '10 20:08

Tyler


People also ask

How do you handle a large amount of data in Java?

Provide more memory to your JVM (usually using -Xmx / -Xms ) or don't load all the data into memory. For many operations on huge amounts of data there are algorithms which don't need access to all of it at once. One class of such algorithms are divide and conquer algorithms.

Which collection is best for large data?

public ArrayList(int initialCapacity)

How is large data stored in memory?

Use in-process in-memory database like H2 keeping in mind its own limitations (H2 also even can rely on own in-memory file system) Use off-process memory storage like Memcached with corresponding Java client. Set up RAM disk (or use tmpfs, or something like that) and work with memory as with a file system from Java.


1 Answers

Specify the same ORDER BY clause (based on the "key") for both result sets. Then you only have to have one record from each result set in memory at once.

For example, say your results are res1 and res2.

If the key field of res1 is less than the key field of res2, res2 is missing some records; iterate res1 until its key field is equal to or greater than the key of res2.

Likewise, if the key field of res1 is greater than the key field of res2, res1 is missing some records; iterate res2 instead.

If the key fields of the current records are equal, you can compare their values, then iterate both result sets.

You can see, in this manner, that only one record from each result is required to be held in memory at a given time.

like image 187
erickson Avatar answered Oct 16 '22 12:10

erickson