Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's an efficient way to store a time-series?

I've a pretty large amount of data in a horrible format: 16.4GiB of zip files containing csv files. Each csv contains lines like

TYPE,2014-07-02 04:04:23.806,0.94598,0.94607

Basically:

  • A TYPE (there are 14 different types and they could ideally grow in the future)
  • A timestamp (in ms)
  • 2 float numbers (max 5 decimal places) which have usually a little spread
  • All the float numbers of the same TYPE have usually very similar values when they're close in time
  • The delta between timestamps is under 1s
  • The data spans for 5 years

In total I've 3'091'472'167 rows, so we're talking billions. The main operations will be looping over part or all the data set (in order to run algorithms on the data) and one insertion of ~20 millions of records once per month (but inserting data is not something I'm interested in optimizing for).

The ideal solution would have been storing them in a database so that I could query them easily, but some rough calculations showed me that I would need 46GiB with Postgres (which MAYBE could be brought down to 18GiB with some ugly tricks).

I've done some experiments with the data I have and I found out that: Generating gzipped files containing TIMESTAMP,FLOAT1,FLOAT2 would get me to ~14GiB Removing entirely the timestamps would get me down to ~5GiB

A good way to save space could be storing the timestamp in an efficient way. I was thinking of creating a tree structure where every node contains one digit of the timestamp and where the leafs contain the final digit and the two floats.

I've tried looking into solutions like druid.io but it looks like they're trying to optimize for other things my use case doesn't need.

Is there some tool I'm missing which does exactly what I want? If not, what's an efficient way to store time series data?

Thank you

like image 858
framp Avatar asked Mar 19 '23 16:03

framp


2 Answers

On the PostgreSQL side, you could also use an open source columnar store. This gets you compression, columnar layout, and skip indexes for free. Storage and disk I/O related benefits are:

  1. Since column values are stored sequentially, you get better compression ratios
  2. The database only reads the columns you are querying, and skips the rest
  3. cstore_fdw builds lightweight skip indexes. For time-series data, these indexes automatically filter unrelated data.

For more info: https://news.ycombinator.com/item?id=7523950

If you further need to scale out your queries to multiple CPU cores/machines, you could then use a scalable PostgreSQL: http://www.citusdata.com/downloads

Apart from compression and an efficient data layout, you get PostgreSQL's type checking, data manipulation functionality, and all its querying capabilities for free.

like image 199
ozgun Avatar answered Apr 08 '23 03:04

ozgun


I would:

  1. Break up the data into several thousand segments, grouped by time. I don't know the spread of times, but perhaps a file for each hour.

  2. Store them in subdirectories by timestamp. E.g. 2014/07/02/04.

  3. Put the starting time stamp in the file name. E.g. 2014-07-02 04:04:23.806.gz.

  4. For all lines, store the type as an index in the first byte. This permits expansion to 256 types. If that's not enough, make it two bytes.

  5. For all lines the time stamp is a two-byte integer, and the floats are signed n-byte integers, sufficient to hold the values times 105. I am assuming that these float values are limited in range. Four bytes for each may be enough.

  6. For the first line, the time stamp is zero, representing the difference from the time of that line and the file name.

  7. For the first line, the float values are the actual values.

  8. For all subsequent lines, the time and float values are the difference from the previous line. Note that they can be negative.

  9. Interleave the most-significant bytes of the values to maximize the runs of zeros for small deltas.

  10. Compress each file with zlib. In addition to the default mode, try the Z_FILTERED and Z_HUFFMAN modes to see if the compression is better or worse.

like image 40
Mark Adler Avatar answered Apr 08 '23 03:04

Mark Adler