Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When does Cassandra hit Amdahl's law?

Tags:

cassandra

I am trying to understand the claims that Cassandra scales linearly with the number of nodes. In a quick look around the 'net I have not seen much of a treatment of this topic. Surely there are serial processing elements in Cassandra that must limit the speed gained as N increases. Any thoughts, pointers or links on this subject would be appreciated.

Edit to provide perspective:
I am working on a project that has a current request for a 1,000+ node Cassandra infrastructure. I did not come-up with this spec. I find myself proposing that N be reduced to a range between 200 and 500, with each node being at least twice as fast for serial computation. This is easy to achieve without a cost penalty per node by making simple changes to the server configuration.

like image 748
martin's Avatar asked Jan 12 '12 17:01

martin's


1 Answers

Cassandra's scaling is better described in terms of Gustafson's law, rather than Amdahl's law. Gustafson scaling looks at how much more data you can process as the number of nodes increases. That is, if you have N times as many nodes, you can process a dataset N times larger in the same amount of time.

This is possible because Cassandra uses very little cluster-wide coordination, except for schema and ring changes. Most operations only involve a number of nodes equal to the replication factor, which stays constant as the dataset grows -- hence nearly linear scale out.

By contrast, Amdahl scaling looks at how much faster you can process a fixed dataset as the number of nodes increases. That is, if you have N times as many nodes, can you process the same dataset N times faster?

Clearly, at some point you reach a limit where adding more nodes doesn't make your requests any faster, because there is a minimum amount of time needed to service a request. Cassandra is not linear here.

In your case, it sounds like you're asking whether it's better to have 1,000 slow nodes or 200 fast ones. How big is your dataset? It depends on your workload, but the usual recommendation is that the optimal size of nodes is around 1TB of data each, making sure you have enough RAM and CPU to match (see cassandra node limitations). 1,000 sounds like far too many, unless you have petabytes of data.

like image 128
Theodore Hong Avatar answered Sep 22 '22 02:09

Theodore Hong