I've a pretty large amount of data in a horrible format: 16.4GiB of zip files containing csv
files.
Each csv
contains lines like
TYPE,2014-07-02 04:04:23.806,0.94598,0.94607
Basically:
TYPE
(there are 14 different types and they could ideally grow in the future)timestamp
(in ms)TYPE
have usually very similar values when they're close in timeIn total I've 3'091'472'167 rows, so we're talking billions. The main operations will be looping over part or all the data set (in order to run algorithms on the data) and one insertion of ~20 millions of records once per month (but inserting data is not something I'm interested in optimizing for).
The ideal solution would have been storing them in a database so that I could query them easily, but some rough calculations showed me that I would need 46GiB with Postgres (which MAYBE could be brought down to 18GiB with some ugly tricks).
I've done some experiments with the data I have and I found out that:
Generating gzipped files containing TIMESTAMP,FLOAT1,FLOAT2
would get me to ~14GiB
Removing entirely the timestamps would get me down to ~5GiB
A good way to save space could be storing the timestamp in an efficient way. I was thinking of creating a tree structure where every node contains one digit of the timestamp and where the leafs contain the final digit and the two floats.
I've tried looking into solutions like druid.io but it looks like they're trying to optimize for other things my use case doesn't need.
Is there some tool I'm missing which does exactly what I want? If not, what's an efficient way to store time series data?
Thank you
On the PostgreSQL side, you could also use an open source columnar store. This gets you compression, columnar layout, and skip indexes for free. Storage and disk I/O related benefits are:
For more info: https://news.ycombinator.com/item?id=7523950
If you further need to scale out your queries to multiple CPU cores/machines, you could then use a scalable PostgreSQL: http://www.citusdata.com/downloads
Apart from compression and an efficient data layout, you get PostgreSQL's type checking, data manipulation functionality, and all its querying capabilities for free.
I would:
Break up the data into several thousand segments, grouped by time. I don't know the spread of times, but perhaps a file for each hour.
Store them in subdirectories by timestamp. E.g. 2014/07/02/04.
Put the starting time stamp in the file name. E.g. 2014-07-02 04:04:23.806.gz
.
For all lines, store the type as an index in the first byte. This permits expansion to 256 types. If that's not enough, make it two bytes.
For all lines the time stamp is a two-byte integer, and the floats are signed n-byte integers, sufficient to hold the values times 105. I am assuming that these float values are limited in range. Four bytes for each may be enough.
For the first line, the time stamp is zero, representing the difference from the time of that line and the file name.
For the first line, the float values are the actual values.
For all subsequent lines, the time and float values are the difference from the previous line. Note that they can be negative.
Interleave the most-significant bytes of the values to maximize the runs of zeros for small deltas.
Compress each file with zlib. In addition to the default mode, try the Z_FILTERED
and Z_HUFFMAN
modes to see if the compression is better or worse.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With