I have a rather large dataset consisting of 2.3GB worth of data spread over 160 Million byte[] arrays with an average data length of 15 bytes. The value for each byte[] key is only an int so memory usage of nearly half the hashmap (which is over 6GB) is made up of the 16 byte overhead of each byte array
overhead = 8 byte header + 4 byte length rounded up by VM to 16 bytes.
So my overhead is 2.5GB.
Does anybody know of a hashmap implementation that stores its (variable length) byte[] keys in one single large byte array so there would be no overhead (apart from 1 byte length field)?
I would rather not use an in memory DB as they usually have a performance overhead compared to a plain Trove TObjectIntHashMap that i'm using and I value cpu cycles even more then memory usage.
Thanks in advance
Since most personal computers have 16GB these days and servers often 32 - 128 GB or more, is there a real problem with there being a degree of bookkeeping overhead?
If we consider the alternative: the byte data concatenated into a single large array -- we should think about what individual values would have to look like, to reference a slice of a larger array.
Most generally you would start with:
public class ByteSlice {
protected byte[] array;
protected int offset;
protected int len;
}
However, that's 8 bytes + the size of a pointer (perhaps just 4 bytes?) + the JVM object header (12 bytes on a 64-bit JVM). So perhaps total 24 bytes.
If we try and make this single-purpose & minimalist, we're still going to need 4 bytes for the offset.
public class DedicatedByteSlice {
protected int offset;
protected byte len;
protected static byte[] getArray() {/*somebody else knows about the array*/}
}
This is still 5 bytes (probably padded to 8) + the JVM object header. Probably still total 20 bytes.
It seems that the cost of dereferencing with an offset & length, and having an object to track that, are not substantially less than the cost of storing the small array directly.
One further theoretical possibility -- de-structuring the Map Key so it is not an object
It is possible to conceive of destructuring the "length & offset" data such that it is no longer in an object. It then is passed as a set of scalar parameters eg (length, offset) and -- in a hashmap implementation -- would be stored by means of arrays of the separate components (eg. instead of a single Object[] keyArray).
However I expect it is extremely unlikely that any library offers an existing hashmap implementation for your (very particular) usecase.
If you were talking about the values it would probably be pointless, since Java does not offer multiple returns or method OUT parameters; which makes communication impractical without "boxing" destructured data back to an object. Since here you are asking about Map Keys specifically, and these are passed as parameters but need not be returned, such an approach may be theoretically possible to consider.
[Extended] Even given this it becomes tricky -- for your usecase the map API probably has to become asymmetrical for population vs lookup, as population has to be by (offset, len) to define keys; whereas practical lookup is likely still by concrete byte[] arrays.
OTOH: Even quite old laptops have 16GB now. And your time to write this (times 4-10 to maintain) should be worth far more than the small cost of extra RAM.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With