I'd like to design a JVM data structure (Java/Scala) that can be used to represent and store the contents of arbitrary relational database tables. The data structure should be fast (not too gc-intensive, cache-friendly) and memory efficient, so larger tables can fit in RAM.
One memory-efficient solution is to store each column separately in a primitive array, but I'm worried about the cache friendliness because items in the same row are not stored together. A row with N columns will incur N cache misses, no matter how narrow the columns.
Another solution is to store each row in an object array where each element represents a field and is cast to the correct type on retrieval, but this requires storing numeric types in their boxed form, so it's not very memory-efficient. And it's probably not that cache efficient either.
Another solution is to layout each row's data into a byte array the same way real databases serialize their rows, using only as many bytes as necessary. This is cache-friendly and memory efficient, but I'm concerned about the cost of serialization/de-serialization on every access.
What's the best way?
A fourth solution would be to store each row's data as strings instead of byte arrays. This may avoid serialization costs in most cases - provided that most data will be strings.
This will also be easier to debug and will be platform independent. Of course it has some limitations: e.g. a float can not be represented as-is, but may be stored in something similar to a SQL DECIMAL format.
Any solution will be a trade-off.
EDIT However, I would prefer the byte array solution for your case: one byte-array per row. This should be most cache-friendly for fixed-size rows. But then you should also provide a solution for variable-sized rows. A low-level language seems to fit that task better, in C one could define two formats: fixed size rows where the table metadata contains column-offsets (e.g. column 1: bytes 0..31, column 2: bytes 32..127 etc.), and a second variable size row format, where the rows itself contain the columns sizes (e.g. bytes 1..3 contain the size, the following number of bytes contain the data, then another 4 bytes contain the size, following data and so on).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With