Designing timeseries database in Cassandra

Question

I am looking at creating a Cassandra timeseries database for storing millions of series of daily data that can potentially have altogether up to 100B data points.

I looked at this article: http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/

This design is very sound. So essentially I can put the daily timestamps as columns and if necessary shard the columns by appending the day to the row.

Two questions I have:

I am looking at storing up to 20,000 timestamped (daily) columns. Is it even necessary to shard rows by eg. year with this amount of columns? Is there any advantage/disadvantage to sharding rows to reduce the number of columns down to 365 per year.
Another idea I have is to rather than sharding columns by row is to create column family per each year. This way when accessing the data from multiple years I would have to query multiple column families rather than one column family and join the results on the client side. Would this approach speed things up or rather slow everything down?

le-doude · Accepted Answer

If you are ever going to manage huge quantities of writes there is one problem with your approach.

Writing always to 1 key means that all writes for that key will go to one node. Basically you will use one node per day out of your cluster, so you might as well have one huge instance of Cassandra rather than bother setting up a cluster. If your write frequency gets really high you might bring down the nodes responsible for that day/key.

My advise is to bucket one day in multiple rows that are used simultaneously. Time bucketing could be dangerous as a sudden surge during one bucket could bring everything down.

you could create your bucket (row key) like this :

[ROW_BASE_NAME] + [DAY] + someHashFunction(timestamp) % 10
[ROW_BASE_NAME] + [DAY] + random.nextInt(10)
[ROW_BASE_NAME] + [DAY] + nextbucket <--- that is if you have a secure way to rotate the bucket yourself

There is many ways to do it. You could also use some element of the column being saved to do that. But I think it should be important to do that in order to leverage the whole cassandra cluster at all times.

My answer is only valid for Write heavy application/functionality since you will have to use a multi_get (multiple keys whole row reads) to read all the data and reconstitute the whole time line for that day.

Designing timeseries database in Cassandra

Tags:

cassandra

time-series

datageek

1 Answers

le-doude

Recent Activity

Donate For Us

Designing timeseries database in Cassandra

Tags:

cassandra

time-series

datageek

1 Answers

le-doude

Related questions

Recent Activity

Donate For Us