Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HashMap in Java, 100 Million entries

Tags:

java

hashmap

I want to store 100 Million terms and their frequencies (in a text database ) into a HashMap <String, Double>. It is giving me "Out of Memory" Error. I tried to increase the heap-space to -Xmx15000M. However it runs half an hour then again throw the same exception. The file size from which I'm trying to read the words and frequencies is 1.7GB.

Any help would be much appreciated.

Thanks :-)

like image 257
ablimit Avatar asked Nov 02 '10 17:11

ablimit


People also ask

How many entries can a HashMap hold?

A HashMap in Java can have a maximum of 2^30 buckets for storing entries - this is because the bucket-assignment technique used by java. util. HashMap requires the number of buckets to be a power of 2, and since ints are signed in Java, the maximum positive value is 2^31 - 1, so the maximum power of 2 is 2^30.

What is the maximum limit of HashMap in Java?

In Sun's JVM, HashMap uses an array which is a power of 2. The largest power of two allowed for an array size is 2^30 . And the largest number of elements you can have before the HashMap will try to double its size to 2^31 (which it cannot do) is ( 2^30 * loadFactor ) or about 700 million for the default load factor.

What is the size of HashMap in Java?

Capacity is the number of buckets in the HashMap. Finally, the default initial capacity of the HashMap is 16. As the number of elements in the HashMap increases, the capacity is expanded.

Why initial capacity of HashMap is 16?

Initial Capacity of HashMap The initial capacity of the HashMap is the number of buckets in the hash table. It creates when we create the object of HashMap class. The initial capacity of the HashMap is 24, i.e., 16. The capacity of the HashMap is doubled each time it reaches the threshold.


2 Answers

For word processing like that the answer is usually a tree rather than hashmap, if you can live with the longer lookup times. That structure is quite memory efficient for natural languages, where many words have common start strings.

Depending on the input, a Patricia tree might be even better.

(Also, if this is indeed words from a natural language, are you sure you really need 100,000,000 entries? The majority of commonly used words is surprisingly low, commercial solutions (word prediction, spelling correction) rarely use more than 100,000 words regardless of language.)

like image 85
Christoffer Avatar answered Oct 02 '22 20:10

Christoffer


Your problem is that 1.7 GB raw Text is more than 1500 MB even without the overhead added by the individual string objects. For huge mappings you should either use a database or a file backed Map, these would use disk memory instead of heap.

Update

I don't think allocating 15 GB for the heap is possible for most jvms. It wont work with any 32bit jvm and I don't think that a 64bit jvm would work either. 15 GB of memory should work on a 64 bit jvm when enough RAM is available.

like image 25
josefx Avatar answered Oct 02 '22 19:10

josefx