Why is it so bad to have large partitions in Cassandra?

1 Answers

For starters

The maximum number of cells (rows x columns) in a single partition is 2 billion.

If you allow a partition to grow unbounded you will eventually hit this limitation.

Outside that theoretical limit, there are practical limitations tied to the impacts large partitions have on the JVM and read times. These practical limitations are constantly increasing from version to version. This practical limitation is not fixed but variable with data model, query patterns, heap size, and configurations which makes it hard to be give a straight answer on whats too large.

As of 2.1 and early 3.0 releases, the primary cost on reads and compactions comes from deserializing the index which marks a row every column_index_size_in_kb. You can increase the key_cache_size_in_mb for reads to prevent unnecessary deserialization but that reduces heap space and fills old gen. You can increase the column index size but it will increase worst case IO costs on reads. Theres also many different settings for CMS and G1 to tune the impact of a huge spike in object allocations when reading these big partitions. There are active efforts on improving this so in the future it might no longer be the bottleneck.

Repairs also only go down to (in best case scenario) the partition level. So if say you are constantly appending to a partition, and a hash of that partition on 2 nodes are compared at not an exact time (distributed system essentially guarantees this), the entire partition must be streamed over to ensure consistency. Incremental repairs can reduce impact of this, but your still streaming massive amounts of data and fluctuating disk significantly which will then need to be compacted together unnecessarily.

You can probably keep adding onto this of corner cases and scenarios that have issues. Many times large partitions are possible to read, but the tuning and corner cases involved in them are not really worth it, better to just design data model to be friendly with how Cassandra expects it. I would recommend targeting 100mb but you can go far beyond that comfortably. Into the Gbs and you will need to start consider tuning for it (depending on data model, use case etc).

195

answered Sep 28 '22 04:09

Chris Lohfink

Related questions
                            
                                How is Cassandra designed to avoid the need for load balancers?
                            
                                How to use cql queries to get different datatypes out of cassandra with java client hector
                            
                                Cassandra Limit 10,20 clause
                            
                                Difference between Cassandra Row caching and Partition key caching
                            
                                cassandra copy data from one columnfamily to another columnfamily
                            
                                How Cassandra select the node to send request?
                            
                                CQLSH: Converting unix timestamp to datetime
                            
                                Does CQL3 require a schema for Cassandra now?
                            
                                connecting to cassandra from PHP [closed]
                            
                                How to use Kafka Connect for Cassandra without Confluent
                            
                                Is there a schema versioning tool for cassandra [closed]
                            
                                Cassandra type error
                            
                                Cassandra-cli cant connect to remote cassandra server
                            
                                Counter Vs Int column in Cassandra?
                            
                                Disable colors in cqlsh
                            
                                Is it possible to use cql to query collections in a row?
                            
                                Inserting Analytic data from Spark to Postgres
                            
                                When are rows overwritten in cassandra
                            
                                Problems connecting to Cassandra pool from Spring application
                            
                                Cassandra Wide Row/Dynamic Columns

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is it so bad to have large partitions in Cassandra?

Tags:

cassandra

Glide

People also ask

1 Answers

Chris Lohfink

Recent Activity

Donate For Us