Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal data structure for time and source dependent log data for fast browsing?

Tags:

c++

algorithm

I've got field bus data that gets sent in packets and contains a datum (e.g. a float) from a source.

=> I get timestamps with a source ID and a value.

Now I want to create a little program (actually a logging deamon in C++ that offers a query interface over HTTP for displaying the data in a plot diagram) where the user can select a few of the sources and the interesting time range and then gets the data drawn. This deamon will run under a Linux-based embedded system.

So my question is: what is the most efficient (query performance and memory consumption) data storage scheme for that data?


Addendum #1:

Although I think the algorithm question is very interesting stand alone I will provide a few informations about the problem that caused this question:

  • Data rate is typically 3 packets / second (bursts up to 30/s are usual)
  • Interesting data might be as old as a month (the more the better; the algorithm might use an hierarchy that allows ultra fast lookup for the last day, fast lookup for the last week and acceptable lookup for older data)
  • the IDs are (at the moment) 32 bits wide.
  • There are roghly 1000 IDs used - but it's not known in advance which and the user might use an additional ID any time
  • The values stored will have different data types (boolean, integer, float - even 14 byte width strings are possible)

Doing a bit of math:

  • Assuming a 32 bit timestamp + 32 bit ID + 32 bit values on average will create a datum to store of 12 bytes
  • That'll be for a month 12*3*60*60*24*30 = about 100 MB of data to filter trough (in real-time with an 500 MHz Geode CPU)
  • Showing the plot for the last day will filter out 1/30th of the data - that'll leave 3 MB to filter through.
  • That 3 MB will be reduced to 1/1000th (= 3 KB) by showing only the relevant ID.

Addendum #2:

This problem asks basically how do I transfer a 2D dataset (time and ID are the dimensions) into memory (and from there serialize it to a file). And the constraint is that both dimensions will be filtered.

The suggested time sorted array is an obvious solution to handle the time-dimension. (To increase query performance an tree based index might be used. A binary search itself isn't so easy as each entry might have a different size - but the index tree covers that nicely and basically has the same underlying idea).

Going that route (first one dimension (time), then the other one) will result in a poor performance (I fear) for the ID filtering, as I have to use a brute force lookup.

like image 280
Chris Avatar asked Dec 05 '25 23:12

Chris


1 Answers

You could just store your data in SQLite and have your web-server run SQL queries against it. Using existing tools you can prototype rapidly and see how well it scales for your purposes.

like image 183
Maxim Egorushkin Avatar answered Dec 08 '25 12:12

Maxim Egorushkin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!