Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Structure To Store Arbitrary Database Tables

I'd like to design a JVM data structure (Java/Scala) that can be used to represent and store the contents of arbitrary relational database tables. The data structure should be fast (not too gc-intensive, cache-friendly) and memory efficient, so larger tables can fit in RAM.

One memory-efficient solution is to store each column separately in a primitive array, but I'm worried about the cache friendliness because items in the same row are not stored together. A row with N columns will incur N cache misses, no matter how narrow the columns.

Another solution is to store each row in an object array where each element represents a field and is cast to the correct type on retrieval, but this requires storing numeric types in their boxed form, so it's not very memory-efficient. And it's probably not that cache efficient either.

Another solution is to layout each row's data into a byte array the same way real databases serialize their rows, using only as many bytes as necessary. This is cache-friendly and memory efficient, but I'm concerned about the cost of serialization/de-serialization on every access.

What's the best way?

like image 835
Seun Osewa Avatar asked Aug 06 '10 17:08

Seun Osewa


1 Answers

A fourth solution would be to store each row's data as strings instead of byte arrays. This may avoid serialization costs in most cases - provided that most data will be strings.

This will also be easier to debug and will be platform independent. Of course it has some limitations: e.g. a float can not be represented as-is, but may be stored in something similar to a SQL DECIMAL format.

Any solution will be a trade-off.

EDIT However, I would prefer the byte array solution for your case: one byte-array per row. This should be most cache-friendly for fixed-size rows. But then you should also provide a solution for variable-sized rows. A low-level language seems to fit that task better, in C one could define two formats: fixed size rows where the table metadata contains column-offsets (e.g. column 1: bytes 0..31, column 2: bytes 32..127 etc.), and a second variable size row format, where the rows itself contain the columns sizes (e.g. bytes 1..3 contain the size, the following number of bytes contain the data, then another 4 bytes contain the size, following data and so on).

like image 171
Frunsi Avatar answered Oct 13 '22 11:10

Frunsi