Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Cassandra to store immutable data?

We're investigating options to store and read a lot of immutable data (events) and I'd like some feedback on whether Cassandra would be a good fit.

Requirements:

  1. We need to store about 10 events per seconds (but the rate will increase). Each event is small, about 1 Kb.
  2. A really important requirement is that we need to be able to replay all events in order. For us it would be fine to read all data in insertion order (like a table scan) so an explicit sort might not be necessary.

Querying the data in any other way is not a prime concern and since Cassandra is a schema db I don't suppose it's possible when the events come in many different forms? Would Cassandra be a good fit for this? If so is there something one should be aware of?

like image 291
Johan Avatar asked Jan 16 '16 19:01

Johan


1 Answers

I've had the exact same requirements for a "project" (rather a tool) a year ago, and I used Cassandra and I didn't regret. In general it fits very well. You can fit quite a lot of data in a Cassandra cluster and the performance is impressive (although you might need tweaking) and the natural ordering is a nice thing to have.

Rather than expressing the benefits of using it, I'll rather concentrate on possible pitfalls you might not consider before starting.

You have to think about your schema. The data is naturally ordered within one row by the clustering key, in your case it will be the timestamp. However, you cannot order data between different rows. They might be ordered after the query, but it is not guaranteed in any way so don't think about it. There was some kind of way to write a query before 2.1 I believe (using order by and disabling paging and allowing filtering) but that introduced bad performance and I don't think it is even possible now. So you should order data between rows on your querying side.

This might be an issue if you have multiple variable types (such as temperature and pressure) that have to be replayed at the same time, and you put them in different rows. You have to get those rows with different variable types, then do your resorting on the querying side. Another way to do it is to put all variable types in one row, but than filtering for only a subset is an issue to solve.

Rowlength is limited to 2 billion elements, and although that seems a lot, it really is not unreachable with time series data. Especially because you don't want to get near those two billions, keep it lower in hundreds of millions maximum. If you put some parameter on which you will split the rows (some increasing index or rounding by day/month/year) you will have to implement that in your query logic as well.

Experiment with your queries first on a dummy example. You cannot arbitrarily use <, > or = in queries. There are specific rules in SQL with filtering, or using the WHERE clause..

All in all these things might seem important, but they are really not too much of a hassle when you get to know Cassandra a bit. I'm underlining them just to give you a heads up. If something is not logical at first just fall back to understanding why it is like that and the whole theory about data distribution and the ring topology.

Don't expect too much from the collections within the columns, their length is limited to ~65000 elements.

Don't fall into the misconception that batched statements are faster (this one is a classic :) )

like image 190
Aleksandar Stojadinovic Avatar answered Sep 29 '22 08:09

Aleksandar Stojadinovic