Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NoSQL for time series/logged instrument reading data that is also versioned

My Data

It's primarily monitoring data, passed in the form of Timestamp: Value, for each monitored value, on each monitored appliance. It's regularly collected over many appliances and many monitored values.

Additionally, it has the quirky feature of many of these data values being derived at the source, with the calculation changing from time to time. This means that my data is effectively versioned, and I need to be able to simply call up only data from the most recent version of the calculation. Note: This is not versioning where the old values are overwritten. I simply have timestamp cutoffs, beyond which the data changes its meaning.

My Usage

Downstream, I'm going to have various undefined data mining/machine learning uses for the data. It's not really clear yet what those uses are, but it is clear that I will be writing all of the downstream code in Python. Also, we are a very small shop, so I can really only deal with so much complexity in setup, maintenance, and interfacing to downstream applications. We just don't have that many people.

The Choice

I am not allowed to use a SQL RDBMS to store this data, so I have to find the right NoSQL solution. Here's what I've found so far:

  1. Cassandra
    • Looks totally fine to me, but it seems like some of the major users have moved on. It makes me wonder if it's just not going to be that much of a vibrant ecosystem. This SE post seems to have good things to say: Cassandra time series data
  2. Accumulo
    • Again, this seems fine, but I'm concerned that this is not a major, actively developed platform. It seems like this would leave me a bit starved for tools and documentation.
  3. MongoDB
    • I have a, perhaps irrational, intense dislike for the Mongo crowd, and I'm looking for any reason to discard this as a solution. It seems to me like the data model of Mongo is all wrong for things with such a static, regular structure. My data even comes in (and has to stay in) order. That said, everybody and their mother seems to love this thing, so I'm really trying to evaluate its applicability. See this and many other SE posts: What NoSQL DB to use for sparse Time Series like data?
  4. HBase
    • This is where I'm currently leaning. It seems like the successor to Cassandra with a totally usable approach for my problem. That said, it is a big piece of technology, and I'm concerned about really knowing what it is I'm signing up for, if I choose it.
  5. OpenTSDB
    • This is basically a time-series specific database, built on top of HBase. Perfect, right? I don't know. I'm trying to figure out what another layer of abstraction buys me.

My Criteria

  • Open source
  • Works well with Python
  • Appropriate for a small team
  • Very well documented
  • Has specific features to take advantage of ordered time series data
  • Helps me solve some of my versioned data problems

So, which NoSQL database actually can help me address my needs? It can be anything, from my list or not. I'm just trying to understand what platform actually has code, not just usage patterns, that support my super specific, well understood needs. I'm not asking which one is best or which one is cooler. I'm trying to understand which technology can most natively store and manipulate this type of data.

Any thoughts?

like image 881
jsmith54 Avatar asked Jun 23 '12 02:06

jsmith54


2 Answers

It sounds like you are describing one of the most common use cases for Cassandra. Time series data in general is often a very good fit for the cassandra data model. More specifically many people store metric/sensor data like you are describing. See:

  • http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/
  • http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
  • http://engineering.rockmelt.com/post/17229017779/modeling-time-series-data-on-top-of-cassandra

As far as your concerns with the community I'm not sure what is giving you that impression, but there is quite a large community (see irc, mailing lists) as well as a growing number of cassandra users.

http://www.datastax.com/cassandrausers

Regarding your criteria:

  • Open source
    • Yes
  • Works well with Python
    • http://pycassa.github.com/pycassa/
  • Appropriate for a small team
    • Yes
  • Very well documented
    • http://www.datastax.com/docs/1.1/index
  • Has specific features to take advantage of ordered time series data
    • See above links
  • Helps me solve some of my versioned data problems
    • If I understand your description correctly you could solve this multiple ways. You could start writing a new row when the version changes. Alternatively you could use composite columns to store the version along with the timestamp/value pair.

I'll also note that Accumulo, HBase, and Cassandra all have essentially the same data model. You will still find small differences around the data model in regards to specific features that each database offers, but the basics will be the same.

The bigger difference between the three will be the architecture of the system. Cassandra takes its architecture from Amazon's Dynamo. Every server in the cluster is the same and it is quite simple to setup. HBase and Accumulo or more direct clones of BigTable. These have more moving parts and will require more setup/types of servers. For example, setting up HDFS, Zookeeper, and HBase/Accumulo specific server types.

Disclaimer: I work for DataStax (we work with Cassandra)

like image 62
nickmbailey Avatar answered Nov 11 '22 13:11

nickmbailey


I only have experience in Cassandra and MongoDB but my experience might add something.

So your basically doing time based metrics?

Ok if I understand right you use the timestamp as a versioning mechanism so that you query per a certain timestamp, say to get the latest calculation used you go based on the metric ID or whatever and get ts DESC and take off the first row?

It sounds like a versioned key value store at times.

With this in mind I probably would not recommend either of the two I have used.

Cassandra is too rigid and it's too heirachal, too based around how you query to the point where you can only make one pivot of graph data from (I presume you would wanna graph these metrics) the columfamily which is crazy, hence why I dropped it. As for searching (which Facebook use it for, and only that) it's not that impressive either.

MongoDB, well I love MongoDB and I am an elite of the user group and it could work here if you didn't use a key value storage policy but at the end of the day if your mind is not set and you don't like the tech then let me be the very first to say: don't use it! You will be no good at a tech that you don't like so stay away from it.

Though I would picture this happening in Mongo much like:

{
_id: ObjectID(),
metricId: 'AvailableMessagesInQueue',
formula: '4+5/10.01',
result: NaN
ts: ISODate()
}

And you query for the latest version of your calculation by:

var results = db.metrics.find({ 'metricId': 'AvailableMessagesInQueue' }).sort({ ts: -1 });
var latest = results.getNext();

Which would output the doc structure you see above. Without knowing more of exactly how you wish to query and the general servera and app scenario etc thats the best I can come up with.

I fond this thread on HBase though: http://mail-archives.apache.org/mod_mbox/hbase-user/201011.mbox/%3C5A76F6CE309AD049AAF9A039A39242820F0C20E5@sc-mbx04.TheFacebook.com%3E

Which might be of interest, it seems to support the argument that HBase is a good time based key value store.

I have not personally used HBase so do not take anything I say about it seriously....

I hope I have added something, if not you could try narrowing your criteria so we can answer more dedicated questions.

Hope it helps a little,

like image 40
Sammaye Avatar answered Nov 11 '22 14:11

Sammaye