Before starting to explain my problem, I should mention that I am not looking for a way to increase Java heap memory. I should strictly store these objects.
I am working on storing huge number (5-10 GB) of DNA sequences and their counts (Integer) in a hash table. The DNA sequences (with length 32 or less) consists of 'A', 'C', 'G', 'T', and 'N' (undefined) chars. As we know, when storing a large number of objects in memory, Java has poor space efficiency compared to lower level languages like C and C++. Thus, if I store this sequence as string (it holds about 100 MB memory for a sequence with length ~30), I see the error.
I tried to represent nucleic acids as 'A'=00, 'C'=01, 'G'=10, 'T'=11 and neglect 'N' (because it ruins the char to 2-bit transform as the 5-th acid). Then, concatenate these 2-bit acids into byte array. It brought some improvement but unfortunately I see the error after a couple of hours again. I need a convenient solution or at least a workaround to handle this error. Thank you in advance.
Being fairly complex maybe this here is a weird idea, and would require quite a lot of work, but this is what I would try:
You already pointed out two individual subproblems of your overall task:
HashMap
implementation may be suboptimal for such large collection sizesI would recommend to write a highly tailored hash map implementation for the Map<String, Long>
interface. Internally you do not have to store strings. Unfortunately 5^32 > 2^64, so there is no way to pack your whole string into a single long, well, let's stick to two longs for a key. You can make string to/back long[2] conversion fairly efficiently on the fly when providing a string key to your map implementation (use bit shifts etc).
As for packing the values, here are some considerations:
for a key-value pair a standard hashmap will need to have an array of N longs for buckets, where N is the current capacity, when the bucket is found from the hash key it will need to have a linked list of key-value pairs to resolve keys that produce identical hash codes. For your specific case you could try to optimize it in the following way:
3N
where N
is the capacity to store both keys and values in a continuous array3 * (hashcode % N)
and 3 * (hashcode % N) + 1
you store the long[2] representation of the key, of the first key that matches this bucket or of the only one (on insertion, zero otherwise), at location 3 * (hashcode % N) + 2
you store the corresponding countHashMap<Long2KeyWrapper, Long>
. The idea is to keep the capacity of the array mentioned above (and resize correspondingly) large enough to have by far the largest part of the data in that contiguous array and not in the fallback hash map. This will dramatically reduce the storage overhead of the hashmapGiven the inequality 5^32 > 2^64
your idea to use bits to encode 5 letters seems to be the best I can think of right now. Use 3 bits and correspondingly long[2].
I recommend you look into the Trove4j Collections API; it offers Collections that hold primitives which will use less memory than their boxed, wrapper classes.
Specifically, you should check out their TObjectIntHashMap
.
Also, I wouldn't recommended storing anything as a String
or char
until JDK 9 is released, as the backing char
array of a String
is UTF-16 encoded, using two byte
s per char
. JDK 9 defaults to UTF-8 where only one byte
is used.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With