How does Cassandra scale horizontally ?

Tags:

I've watched a video on Cassandra database, which turns to be very effective and really explains a lot about Cassandra. I've also ready some article and books about Cassandra but the thing I could not understand is how does Cassandra scale horizontally. By horizontally scale I mean add more nodes to gain more space. As I understand each node has the identical data i.e if one node has 1TB of data and is replicated to other nodes this means all n nodes will each contain 1TB of data. Am I missing something here ?

735

asked Jul 27 '15 11:07

Adelin

3 Answers

Yes, you are missing something. Data may not need to be duplicated n times, where n is the number of nodes. You would typically configure your replication factor (RF) to be lower than the number of nodes (N).

For example, RF = 3, N = 5. Meaning each row will be duplicated 3 times across randomly chosen 3 nodes out of 5 nodes (plus the pristine copy). If one node goes down, you will have 3 copies elsewhere on the other nodes.

This works better in larger clusters, e.g. RF = 5, N = 100.

Higher RF improves data redundancy and read speed, but decreases your write speed. So there is a balance, if your RF is very high, like RF = N, you'd have very high data redundancy, high resilience to node failures, and high read throughput. On the other side your write throughput will be very limited, as data needs to be replicated to all the nodes. If one node goes down in this scenario the write may fail (depending on client config) as desired replication factor cannot be achieved.

answered Sep 23 '22 21:09

oleksii

The number of replicas (i.e. the identical data) you want to store for each partition (row/piece of data) is configurable. So, if you have n nodes, you could in theory set the database to replicate each partition n times. Then, horizontal scaling would not occur if you add more nodes. However, if you set the number of replicas to 1 or 2, you have more space per node to store data horizontally. New data can then go into new nodes. Keep in mind though, that with less replicas you have a greater chance of losing data if any set of nodes go down at a particular time.

answered Sep 21 '22 21:09

Rdesmond

As I understand each node has the identical data i.e if one node has 1TB of data and is replicated to other nodes this means all n nodes will each contain 1TB of data. Am I missing something here ?

Yes, not all nodes are necessarily copies of each other. Depending on the level of availability I want to support, I can set my replication factor lower than the total number of nodes.

Let's say that I have a 2 node cluster with a replication factor of 2. So in this case, each node does have a complete copy of the data. If I am running out of disk, I can alleviate some of that by adding a new node while keeping my replication factor set at 2 (3 nodes, RF of 2).

In this way if each disk has 1TB of storage, and I'm at 900GB on each, adding a new node (while keeping my RF the same) makes each node responsible for only 2/3 of the data. So in this case, each node would hold 600GB of data (freeing up 300GB on my 2 existing nodes). And thus, I have increased my disk capacity by scaling horizontally.

The catch is that even though I have 3 nodes, I can really only afford to lose one of them. If I lose two nodes, then I can't serve my queries.

answered Sep 19 '22 21:09

Aaron

Related questions
                            
                                Counter Vs Int column in Cassandra?
                            
                                Disable colors in cqlsh
                            
                                Is it possible to use cql to query collections in a row?
                            
                                Inserting Analytic data from Spark to Postgres
                            
                                When are rows overwritten in cassandra
                            
                                Problems connecting to Cassandra pool from Spring application
                            
                                Cassandra Wide Row/Dynamic Columns
                            
                                Why is it so bad to have large partitions in Cassandra?
                            
                                How to delete graph in Titan with Cassandra storage backend?
                            
                                CQL3 Each row to have its own schema
                            
                                Cassandra Static Column design [closed]
                            
                                Cassandra cli: Convert hex values into a human-readable format
                            
                                Persistent GC issues with Cassandra - long app pauses
                            
                                Why Cassandra cluster need synchronized clocks between nodes?
                            
                                Get first row for each partition key in Cassandra
                            
                                Wrong count(*) with cassandra-cql
                            
                                Cassandra two nodes with redundancy
                            
                                Cassandra partition key for time series data
                            
                                SQL vs NoSQL for an inventory management system
                            
                                How to view all tables in CQL and CQLSH?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does Cassandra scale horizontally ?

Tags:

nosql

cassandra