Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the Leveled Compaction Strategy ensure 90% of reads are from one sstable

I am trying to understand how the Leveled Compaction Strategy in Cassandra works that guarantees 90% of all reads will be satisfied from a single sstable.

From DataStax Doc:

new sstables are added to the first level, L0, and immediately compacted with the sstables in L1. When L1 fills up, extra sstables are promoted to L2. Subsequent sstables generated in L1 will be compacted with the sstables in L2 with which they overlap.

like image 730
Chaity Avatar asked Apr 21 '15 08:04

Chaity


1 Answers

LeveledCompactionStrategy (LCS) in Cassandra implements the internals of LevelDB. You can check the exact implementation details in LevelDB implementation doc.

In order to give you a simple explanation take into account the following points:

  1. Every SSTable is created when a fixed (relatively small) size limit is reached. By default L0 gets 5MB files of files, and each subsequent level is 10x the size. (in L1 you'll have 50MB of data, L2 500MB, and so on).
  2. SSTables are created with the guarantee that they don't overlap
  3. When a level fills up, a compaction is triggered and stables from level-L are promoted to level-L+1. So, in L1 you'll have 50MB in ~10 files, L2 500MB in ~100 files, etc..

In bold are the relevant details that justify the 90% reads from the same file (SSTable). Let's do the math together and everything will become clearer.

Imagine you have keys A,B,C,D,E in L0, and each keys takes 1MB of data.

Next we insert key F. Because level 0 is filled a compaction will create a file with [A,B,C,D,E] in level 1, and F will remain in level 0.

That's ~83% of data in 1 file in L1.

Next we insert G,H,I,J and K. So L0 fills up again, L1 gets a new sstable with [I,G,H,I,J].

By now we have K in L0, [A,B,C,D,E] and [F,G,H,I,J] in L1

And that's ~90% of data in L1.

If we continue inserting keys we will get around the same behavior so, that's why you get 90% of reads served from roughly the same file/SSTable.

A more in depth and detailed (what happens with updates and tombstones) info is given in this paragraph on the link I mentioned (the sizes for compaction election are different because they are LevelDB defaults, not C*s):

When the size of level L exceeds its limit, we compact it in a background thread. The compaction picks a file from level L and all overlapping files from the next level L+1. Note that if a level-L file overlaps only part of a level-(L+1) file, the entire file at level-(L+1) is used as an input to the compaction and will be discarded after the compaction. Aside: because level-0 is special (files in it may overlap each other), we treat compactions from level-0 to level-1 specially: a level-0 compaction may pick more than one level-0 file in case some of these files overlap each other.

A compaction merges the contents of the picked files to produce a sequence of level-(L+1) files. We switch to producing a new level-(L+1) file after the current output file has reached the target file size (2MB). We also switch to a new output file when the key range of the current output file has grown enough to overlap more then ten level-(L+2) files. This last rule ensures that a later compaction of a level-(L+1) file will not pick up too much data from level-(L+2).

like image 77
Luís Correia Avatar answered Sep 23 '22 21:09

Luís Correia