Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Why is it so bad to have large partitions in Cassandra?



I have seen this warning everywhere but cannot find any detailed explanation on this topic.

like image 266
Glide Avatar asked Sep 18 '17 06:09


People also ask

What is partition size in Cassandra?

Partition size is measured by the number of cells (values) that are stored in the partition. Cassandra's hard limit is 2 billion cells per partition, but you'll likely run into performance issues before reaching that limit.

What is wide partition in Cassandra?

Partition in Cassandra represent grouping of similar kind of rows. In Cassandra it is recommended to model your data such that you should have similar kind of rows fall in same partition. This is called wide partition pattern. Searching in Cassandra is super fast using partition key.

What are partitions in Cassandra?

A partitioner is a function that hashes the partition key to generate a token. This token value represents a row and is used to identify the partition range it belongs to in a node. However, a Cassandra client sees the cluster as a unified whole database and communicates with it using a Cassandra driver library.

How do I find a large partition in Cassandra?

Try nodetool tablehistograms -- <keyspace> <table> command provides statistics about a table, including read/write latency, partition size, column count, and number of SSTables. This provides proper stats of the table like 95% percentile of raw_data table has partition size of 107MB and max of 3.44GB.

1 Answers

For starters

The maximum number of cells (rows x columns) in a single partition is 2 billion.

If you allow a partition to grow unbounded you will eventually hit this limitation.

Outside that theoretical limit, there are practical limitations tied to the impacts large partitions have on the JVM and read times. These practical limitations are constantly increasing from version to version. This practical limitation is not fixed but variable with data model, query patterns, heap size, and configurations which makes it hard to be give a straight answer on whats too large.

As of 2.1 and early 3.0 releases, the primary cost on reads and compactions comes from deserializing the index which marks a row every column_index_size_in_kb. You can increase the key_cache_size_in_mb for reads to prevent unnecessary deserialization but that reduces heap space and fills old gen. You can increase the column index size but it will increase worst case IO costs on reads. Theres also many different settings for CMS and G1 to tune the impact of a huge spike in object allocations when reading these big partitions. There are active efforts on improving this so in the future it might no longer be the bottleneck.

Repairs also only go down to (in best case scenario) the partition level. So if say you are constantly appending to a partition, and a hash of that partition on 2 nodes are compared at not an exact time (distributed system essentially guarantees this), the entire partition must be streamed over to ensure consistency. Incremental repairs can reduce impact of this, but your still streaming massive amounts of data and fluctuating disk significantly which will then need to be compacted together unnecessarily.

You can probably keep adding onto this of corner cases and scenarios that have issues. Many times large partitions are possible to read, but the tuning and corner cases involved in them are not really worth it, better to just design data model to be friendly with how Cassandra expects it. I would recommend targeting 100mb but you can go far beyond that comfortably. Into the Gbs and you will need to start consider tuning for it (depending on data model, use case etc).

like image 195
Chris Lohfink Avatar answered Sep 28 '22 04:09

Chris Lohfink