I am looking at creating a Cassandra timeseries database for storing millions of series of daily data that can potentially have altogether up to 100B data points.
I looked at this article: http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/
This design is very sound. So essentially I can put the daily timestamps as columns and if necessary shard the columns by appending the day to the row.
Two questions I have:
If you are ever going to manage huge quantities of writes there is one problem with your approach.
Writing always to 1 key means that all writes for that key will go to one node. Basically you will use one node per day out of your cluster, so you might as well have one huge instance of Cassandra rather than bother setting up a cluster. If your write frequency gets really high you might bring down the nodes responsible for that day/key.
My advise is to bucket one day in multiple rows that are used simultaneously. Time bucketing could be dangerous as a sudden surge during one bucket could bring everything down.
you could create your bucket (row key) like this :
There is many ways to do it. You could also use some element of the column being saved to do that. But I think it should be important to do that in order to leverage the whole cassandra cluster at all times.
My answer is only valid for Write heavy application/functionality since you will have to use a multi_get (multiple keys whole row reads) to read all the data and reconstitute the whole time line for that day.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With