Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

storing telemetry data from 10000s of nodes

I need to store telemetry data that is being generated every few minutes from over 10000 nodes (which may increase), each supplying the data over the internet to a server for logging. I'll also need to query this data from a web application.

I'm having a bit of trouble deciding what the best storage solution would be..

Each node has a unique ID, and there will be a timestamp for each packet of variables. (probably will need to be generated by the server).

The telemetry data has all of the variables in the same packet, so conceptually it could easily be stored in a single database table with a column per variable. The serial number + timestamp would suffice as a key. The size of each telemetry packet is 64 bytes, including the device ID and timestamp. So around 100Gb+ per year.

I'd want to be able to query the data to get variables across time ranges and also store aggregate reports of this data so that I can draw graphs.

Now, how best to handle this? I'm pretty familiar with using MySQL, so I'm leaning towards this. If I were to go for MySQL would it make sense to have a separate table for each device ID? - Would this make queries much faster or would having 10000s of tables be a problem?

I don't think querying the variables from all devices in one go is going to be needed but it might be. Or should I just stick it all in a single table and use MySQL cluster if it gets really big?

Or is there a better solution? I've been looking around at some non relational databases but can't see anything that perfectly fits the bill or looks very mature. MongoDB for example would have quite a lot of size overhead per row and I don't know how efficient it would be at querying the value of a single variable across a large time range compared to MySQL. Also MySQL has been around for a while and is robust.

I'd also like it to be easy to replicate the data and back it up.

Any ideas or if anyone has done anything similar you input would be greatly appreciated!

like image 544
Pete Avatar asked May 09 '12 21:05

Pete


1 Answers

Have you looked at time-series databases? They're designed for the use case you're describing and may actually end up being more efficient in terms of space requirements due to built-in data folding and compression.

I would recommend looking into implementations using HBase or Cassandra for raw storage as it gives you proven asynchronous replication capabilities and throughput.

HBase time-series databases:

  • OpenTSDB

  • KairosDB

  • Axibase Time-Series Database - my affiliation

like image 98
Sergei Rodionov Avatar answered Oct 02 '22 15:10

Sergei Rodionov