Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When NOT to use Cassandra? [closed]

People also ask

When should you not use Cassandra?

Cassandra is based on a NoSQL database and does not provide ACID and relational data properties. If you have a strong requirement for ACID properties (for example Financial data), Cassandra would not be a fit in that case.

Is Cassandra good for frequent updates?

Cassandra is incredible at writes (here are the reasons for this amazing write performance). But it's only append-oriented. If you need to update a lot, Cassandra's no good: for each update, it just adds a 'younger' data version with the same primary key.


There is nothing like a silver bullet, everything is built to solve specific problems and has its own pros and cons. It is up to you, what problem statement you have and what is the best fitting solution for that problem.

I will try to answer your questions one by one in the same order you asked them. Since Cassandra is based on the NoSQL family of databases, it's important you understand why use a NoSQL database before I answer your questions.

Why use NoSQL

In the case of RDBMS, making a choice is quite easy because all the databases like MySQL, Oracle, MS SQL, PostgreSQL in this category offer almost the same kind of solutions oriented toward ACID properties. When it comes to NoSQL, the decision becomes difficult because every NoSQL database offers different solutions and you have to understand which one is best suited for your app/system requirements. For example, MongoDB is fit for use cases where your system demands a schema-less document store. HBase might be fit for search engines, analyzing log data, or any place where scanning huge, two-dimensional join-less tables is a requirement. Redis is built to provide In-Memory search for varieties of data structures like trees, queues, linked lists, etc and can be a good fit for making real-time leaderboards, pub-sub kind of system. Similarly there are other databases in this category (Including Cassandra) which are fit for different problem statements. Now lets move to the original questions, and answer them one by one.

When to use Cassandra

Being a part of the NoSQL family, Cassandra offers a solution for problems where one of your requirements is to have a very heavy write system and you want to have a quite responsive reporting system on top of that stored data. Consider the use case of Web analytics where log data is stored for each request and you want to built an analytical platform around it to count hits per hour, by browser, by IP, etc in a real time manner. You can refer to this blog post to understand more about the use cases where Cassandra fits in.

When to Use a RDMS instead of Cassandra

Cassandra is based on a NoSQL database and does not provide ACID and relational data properties. If you have a strong requirement for ACID properties (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make a workaround for that, however you will end up writing lots of application code to simulate ACID properties and will lose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.

When not to use Cassandra

I don't think it needs to be answered if the above explanation makes sense.


When evaluating distributed data systems, you have to consider the CAP theorem - you can pick two of the following: consistency, availability, and partition tolerance.

Cassandra is an available, partition-tolerant system that supports eventual consistency. For more information see this blog post I wrote: Visual Guide to NoSQL Systems.


Cassandra is the answer to a particular problem: What do you do when you have so much data that it does not fit on one server ? How do you store all your data on many servers and do not break your bank account and not make your developers insane ? Facebook gets 4 Terabyte of new compressed data EVERY DAY. And this number most likely will grow more than twice within a year.

If you do not have this much data or if you have millions to pay for Enterprise Oracle/DB2 cluster installation and specialists required to set it up and maintain it, then you are fine with SQL database.

However Facebook no longer uses cassandra and now uses MySQL almost exclusively moving the partitioning up in the application stack for faster performance and better control.


The general idea of NoSQL is that you should use whichever data store is the best fit for your application. If you have a table of financial data, use SQL. If you have objects that would require complex/slow queries to map to a relational schema, use an object or key/value store.

Of course just about any real world problem you run into is somewhere in between those two extremes and neither solution will be perfect. You need to consider the capabilities of each store and the consequences of using one over the other, which will be very much specific to the problem you are trying to solve.


Besides the answers given above about when to use and when not to use Cassandra, if you do decide to use Cassandra you may want to consider not using Cassandra itself, but one of the its many cousins out there.

Some answers above already pointed to various "NoSQL" systems which share many properties with Cassandra, with some small or large differences, and may be better than Cassandra itself for your specific needs.

Additionally, recently (several years after this question was originally asked), a Cassandra clone called Scylla (see https://en.wikipedia.org/wiki/Scylla_(database)) was released. Scylla is an open-source re-implementation of Cassandra in C++, which claims to have significantly higher throughput and lower latencies than the original Java Cassandra, while being mostly compatible with it (in features, APIs, and file formats). So if you're already considering Cassandra, you may want to consider Scylla as well.


I will focus here on some of the important aspects which can help you to decide if you really need Cassandra. The list is not exhaustive, just some of the points which I have at top of my mind-

  • Don't consider Cassandra as the first choice when you have a strict requirement on the relationship (across your dataset).

  • Cassandra by default is AP system (of CAP). But, it supports tunable consistency which means it can be configured to support as CP as well. So don't ignore it just because you read somewhere that it's AP and you are looking for CP systems. Cassandra is more accurately termed “tuneably consistent,” which means it allows you to easily decide the level of consistency you require, in balance with the level of availability.

  • Don't use Cassandra if your scale is not much or if you can deal with a non-distributed DB.

  • Think harder if your team thinks that all your problems will be solved if you use distributed DBs like Cassandra. To start with these DBs is very simple as it comes with many defaults but optimizing and mastering it for solving a specific problem would require a good (if not a lot) amount of engineering effort.

  • Cassandra is column-oriented but at the same time each row also has a unique key. So, it might be helpful to think of it as an indexed, row-oriented store. You can even use it as a document store.

  • Cassandra doesn't force you to define the fields beforehand. So, if you are in a startup mode or your features are evolving (as in agile) - Cassandra embraces it. So better, first think about queries and then think about data to answer them.

  • Cassandra is optimized for really high throughput on writes. If your use case is read-heavy (like cache) then Cassandra might not be an ideal choice.


Talking with someone in the midst of deploying Cassandra, it doesn't handle the many-to-many well. They are doing a hack job to do their initial testing. I spoke with a Cassandra consultant about this and he said he wouldn't recommend it if you had this problem set.