Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the effect of number of levels in levelled compaction?

I know how levelled compaction works in DBS like Cassandra, rocksdb etc. Some have max number of levels 4 and some have 7. How does this number affect compaction process? Why can't I have just 2 levels, 1st one which has flushed mem-table data (overlap possible between files) and 2nd one which contains nonoverlapping SSTs?

If there is any doc or duplicate question, please redirect.

Edit-1: Duplicate data increases when the number of levels goes up.

like image 448
Bishnu Avatar asked Jan 27 '20 16:01

Bishnu


People also ask

How does level compaction work?

Leveled compaction creates sstables of a fixed, relatively small size (5MB by default in Cassandra's implementation), that are grouped into "levels." Within each level, sstables are guaranteed to be non-overlapping. Each level is ten times as large as the previous.

What is LSM compaction?

Leveled CompactionAn LSM-Tree consists of multiple levels. The size of each level is maintained at T times that of the previous level. When the size ratio of each adjacent level pair is the same, the write amplification is minimal.

What is Rocksdb compaction?

Compaction merges all sorted runs in one level to create a new sorted run in the next level. N in this case is similar to fanout for leveled compaction. Compaction does not read/rewrite sorted runs in Ln when merging into Ln.


2 Answers

LCS comes to solves STCS’s space-amplification problem. It also reduces read amplification (the average number of disk reads needed per read request).

Leveled compaction divides the small sstables (“fragments”) into levels:

Level 0 (L0) is the new sstables, recently flushed from memtables. As their number grows (and reads slow down), our goal is to move sstables out of this level to the next levels. Each of the other levels, L1, L2, L3, etc., is a single run of an exponentially increasing size: L1 is a run of 10 sstables, L2 is a run of 100 sstables, L3 is a run of 1000 sstables, and so on. (Factor 10 is the default setting in both Scylla and Apache Cassandra).

While solving, or at least significantly improving, the space amplification problem, LCS makes another problem, write amplification, worse.

"Write amplification” is the amount of bytes we had to write to the disk for each one byte of newly flushed sstable data. Write amplification is always higher than 1.0 because we write each piece of data to the commit-log, and then write it again to an sstable, and then each time compaction involves this piece of data and copies it to a new sstable, that’s another write.

Read more about it here:

  • https://www.scylladb.com/2018/01/31/compaction-series-leveled-compaction/
  • https://docs.scylladb.com/kb/compaction/
  • https://docs.scylladb.com/architecture/compaction/compaction-strategies/
like image 79
TomerSan Avatar answered Sep 30 '22 04:09

TomerSan


Leveled compaction works Scylla very similarly to how it works in Cassandra and Rocksdb (with some small differences). If you want a short overview on how leveled compaction works in Scylla and why, I suggest that you read my blog post https://www.scylladb.com/2018/01/31/compaction-series-leveled-compaction/.

Your specific question on why two levels (L0 of recently flushed sstables, Ln of disjoint-range sstables) are not enough - is a very good question:

The main problem is that a single flushed memtable (sstable in L0), containing a random collection of writes, will often intersect all of the sstables in Ln. This means rewriting the entire database every time there's a new memtable flushed, and the result is a super-huge amount of write amplification, which is completely unacceptable.

One way to reduce this write amplification significantly (but perhaps not enough) is to introduce a cascade of intermediate levels, L0, L1, ..., Ln. The end result is that we have L(n-1) which is 1/10th (say) the size of Ln, and we merge L(n-1) - not a single sstable - into Ln. This is the approach that leveled compaction strategy (LCS) uses in all systems you mentioned.

A completely different approach could be not to merge a single sstable into Ln, but rather try to collect a large amount of data first, and only then merge it into Ln. We can't just collect 1,000 tables in L0 because this would make reads very slow. Rather, to collect this large amount of data, one could use size-tiered compaction (STCS) inside L0. In other words, this approach is a "mix" of STCS and LCS with two "levels": L0 uses STCS on new sstables, Ln contains a run of sstables (sstables with disjoint ranges). When L0 reaches 1/10th (say) the size of Ln, L0 is compacted into Ln. Such a mixed approach could have lower write amplification than LCS, but because most of the data is in a run in Ln, it would have same low space and read amplifications as in LCS. None of the mentioned databases (Scylla, Cassandra, or Rocksdb) has such "mixed" compaction supported, as far as I know.

like image 33
Nadav Har'El Avatar answered Sep 30 '22 04:09

Nadav Har'El