We're building a measurement system that will eventually consist of thousands of measurement stations. Each station will save around 500 million measurements consisting of 30 scalar values over its lifetime. These will be float values. We're now wondering how to save this data on each station, considering we'll be building a web app on each station such that
additionally we'd like one more server that can show the data of, say, 1000 measurement stations. That would be ~50TB of data in 500 billion measurements. To transmit the data from measurement station to server, I thought that some type of database-level replication would be a clean and efficient way.
Now I'm wondering if a noSQL solution might be better than mySQL for these purposes. Especially couchDB, Cassandra and maybe key-value stores like Redis look appealing to me. Which of those would suit the "measurement time series" data model best in your opinion? What about other advantages like crash-safety and replication from measurement station to main server?
NoSQL databases use a variety of data models for accessing and managing data. These types of databases are optimized specifically for applications that require large data volume, low latency, and flexible data models, which are achieved by relaxing some of the data consistency restrictions of other databases.
MongoDB is a cross-platform document-oriented database program that offers high performance, high availability and easy scalability. It is considered the leading NoSQL database.
NoSQL databases come in a variety of types including document databases, key-values databases, wide-column stores, and graph databases. MongoDB is the world's most popular NoSQL database.
I think CouchDB is a great database -- but it's ability to deal with large data is questionable. CouchDB's primary focus is on simplicity of development and offline replication, not necessarily on performance or scalability. CouchDB itself does not support partitioning, so you'll be limited by the maximum node size unless you use BigCouch or invent your own partitioning scheme.
No foolin, Redis is an in-memory database. It's extremely fast and efficient at getting data in and out of RAM. It does have the ability to use disk for storage, but it's not terribly good at it. It's great for bounded quantities of data that change frequently. Redis does have replication, but does not have any built-in support for partitioning, so again, you'll be on your own here.
You also mentioned Cassandra, which I think is more on target for your use case. Cassandra is well suited for databases that grow indefinitely, essentially it's original use case. The partitioning and availability is baked in so you won't have to worry about it very much. The data model is also a bit more flexible than the average key/value store, adding a second dimension of columns, and can practically accomodate millions of columns per row. This allows time-series data to be "bucketed" into rows that cover time ranges, for example. The distribution of data across the cluster (partitioning) is done at the row level, so only one node is necessary to perform operations within a row.
Hadoop plugs right into Cassandra, with "native drivers" for MapReduce, Pig, and Hive, so it could potentially be used to aggregate the collected data and materialize the running averages. The best practice is to shape data around queries, so probably want to store multiple copies of the data in "denormalized" form, one for each type of query.
Check out this post on doing time-series in Cassandra:
http://rubyscale.com/2011/basic-time-series-with-cassandra/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With