Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structure equivalent to map ( in java ) for large datasets

Is there an already implemented data structure that I can use in order to assign to an object (in my case an Edge), an integer? I am reading a graph from a file , 10 mil vertices , 60 mil edges and I assign to each edge , a cost , using a map ( costs.put(e,cost) ).

I create the costs map in this way :

costs = new HashMap<Edge,Integer>();

The exception that it gives is :

java.lang.OutOfMemoryError: Java heap space
    at java.util.HashMap.resize(Unknown Source)
    at java.util.HashMap.addEntry(Unknown Source)
    at java.util.HashMap.put(Unknown Source) 
like image 804
Radu Stejerean Avatar asked Oct 31 '12 09:10

Radu Stejerean


2 Answers

HashMap is the correct data structure for a basic Map. The problem you are having is that the JVM is not being instructed to reserve enough space to keep the file contents in memory. Start the JVM with a -Xmx flag. For instance -Xmx1G parameter will allow it to use 1 Gigabyte of memory.

like image 196
Tim Bender Avatar answered Oct 15 '22 13:10

Tim Bender


You've got 6e7 edges. A plain Object takes 24 bytes (64-bit HotSpot), so that's 1.44e9 bytes right there (1.5 GB). Now you introduce the most efficient map imaginable, adding only 6e7 references plus 6e7 Integer objects. That's another 2.4e8 bytes for the refs and 1.44e9 bytes for the Integers: another 1.5 GB, the total being now 3 GB—and this is the theoretical lower bound for your problem (modulo caching, see below).

Based on this I suggest you just extend your Edge class with another int field. This will drastically reduce your memory footprint.

If this is not an option, however, and:

  • all your integers rarely exceed two digits,
  • you are careful never to use new Integer, but Integer.valueOf, autoboxing, etc.,
  • you use Java 7,

you'll automatically benefit from the built-in small integer cache. If the integers assume values from a wider range, but still with lot of duplication, a custom cache is highly advisable.

like image 30
Marko Topolnik Avatar answered Oct 15 '22 12:10

Marko Topolnik