Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Database design for high sample rate data, graphing at multiple zoom levels

I've got multiple sensors feeding data to my web app. Each channel is 5 samples per second and the data gets uploaded bundled together in 1 minute json messages (containing 300 samples). The data will be graphed using flot at multiple zoom levels from 1 day to 1 minute.

I'm using Amazon SimpleDB and I'm currently storing the data in the 1 minute chunks that I receive it in. This works well for high zoom levels, but for full days there will be simply be too many rows to retrieve.

The idea I've currently got is that every hour I can crawl through the data and collect together 300 samples for the last hour and store them in another table, essentially down-sampling the data.

Does this sound like a reasonable solution? How have others implemented the same sort of systems?

like image 598
Tim Avatar asked Feb 26 '23 09:02

Tim


2 Answers

Storing downsampled data is perfectly fine approach. Check out how munin stores it's graphs - dayly, mounthly, early and intraday graphs are stored separately there.

You may store data for each minute, each 5 minutes, each hour, each 4 hours, each day in different tables. Overhead is very little in comparison to just storing every minute with lots of benefit as you don't transmit what you don't need to.

like image 161
BarsMonster Avatar answered Apr 27 '23 07:04

BarsMonster


Speed up the database, use direct organization model. It's the fastest method to store/retrieve data from files. The implementation is as simple, that you don't need any framework or library.

The method is:

  1. you have to create an algorhytm, which converts the key to a continous record numero (0..max. number of records),
  2. you have to use fixed record size,
  3. the data is stored in flat files, where the record's position within the file is the rec. no. (based on key, as described in 1.) multiplied by the record size (see 2.).

Native data

You may create one data file per day for easier maintenance. Then your key is the no. of the sample within the day. So, your daily file will be 18000 * 24 * record size. You should pre-create that file with 0s in order to make operating system's life easier (maybe it does not help much, it depends on underlying filesystem/caching mechanism).

So, when a data arrives, calculate the file position, and insert the record to its place.

Summarized data

You should store summarized data in direct files, too. These files will be much smaller ones. In case of the 1-minute summarized values there will be 24*60*60 records in it.

There're some decisions, which you have to take:

  • the stepping of zoom,
  • the steping of the summarized data (it's not sure to worth collect summarized data for each zoom stepping),
  • the organization of the summarized databases (the native data may be stored in daily files, but the daily data should be stored in monthly files).

Another thing is to think about, the creation time of the summarized data. While native data should be stored just as the data arrives, summarized data may be calculated any time:

  • as the native data arrives (in this case, a 1-s data is updated 300 times, which is not optimal to write to disk immediatelly, the summing should be done in memory);
  • a background job should process the native data periodically,
  • the sum data should be created lazy way, on demand.

Don't forget, not-too-many years ago these issues were the database design issues. I can promise one thing: it will be fast, faster than anything (except using memory for storing data).

like image 40
ern0 Avatar answered Apr 27 '23 07:04

ern0