Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra control SSTable size

Is there a way I could control max size of a SSTable, for example 100 MB so that when there is actually more than 100MB of data for a CF, then Cassandra creates next SSTable?

like image 788
RRM Avatar asked Apr 01 '15 13:04

RRM


People also ask

Does Cassandra use SSTable?

Sorted Strings Table (SSTable) is a persistent file format used by ScyllaDB, Apache Cassandra, and other NoSQL databases to take the in-memory data stored in memtables, order it for fast access, and store it on disk in a persistent, ordered, immutable set of files.

What is true about SSTable in Cassandra?

SSTables are the immutable data files that Cassandra uses for persisting data on disk. As SSTables are flushed to disk from memtables or are streamed from other nodes, Cassandra triggers compactions which combine multiple SSTables into one. Once the new SSTable has been written, the old SSTables can be removed.

What is Memtable and SSTable in Cassandra?

Memtables and SSTables are maintained per table. The commit log is shared among tables. SSTables are immutable, not written to again after the memtable is flushed. Consequently, a partition is typically stored across multiple SSTable files.

What is size tiered compaction?

In the original size-tiered compaction, each sstable is an ordered file, and multiple sstables are merged into a larger sstable. Hence, at least 50% of the original space must be reserved as temporary space. Incremental compaction divides sstables into multiple shards, each of which is 1 GB in size by default.


1 Answers

Unfortunately the answer is not so simple, the sizes of your SSTables will be influenced by your compaction Strategy and there is no direct way to control your max sstable size.

SSTables are initially created when memtables are flushed to disk as SSTables. The size of these tables initially depends on your memtable settings and the size of your heap (memtable_total_space_in_mb being a large influencer). Typically these SSTables are pretty small. SSTables get merged together as part of a process called compaction.

If you use Size-Tiered Compaction Strategy you have an opportunity to have really large SSTables. STCS will combine SSTables in a minor compaction when there are at least min_threshold (default 4) sstables of the same size by combining them into one file, expiring data and merging keys. This has the possibility to create very large SSTables after a while.

Using Leveled Compaction Strategy there is a sstable_size_in_mb option that controls a target size for SSTables. In general SSTables will be less than or equal to this size unless you have a partition key with a lot of data ('wide rows').

I haven't experimented much with Date-Tiered Compaction Strategy yet, but that works similar to STCS in that it merges files of the same size, but it keeps data together in time order and it has a configuration to stop compacting old data (max_sstable_age_days) which could be interesting.

The key is to find the compaction strategy which works best for your data and then tune the properties around what works best for your data model / environment.

You can read more about the configuration settings for compaction here and read this guide to help understand whether STCS or LCS is appropriate for you.

like image 68
Andy Tolbert Avatar answered Oct 07 '22 15:10

Andy Tolbert