Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Cassandra (or Scylla) sort clustering columns?

One of the benefits of Cassandra (or Scylla) is that:

When a table has multiple clustering columns, the data is stored in nested sort order. https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/whereClustering.html

Because of this I think reading the data back in that same sorted order should be very fast.

If data is written in a different order than the clustering columns specify, when does Cassandra (or Scylla) actually re-order the data?

Is it when the memtables are flushed to SSTables?

What if a memtable has already been flushed, and I add a new record that should be before records in an existing SSTable?

Does it keep the data out of order on disk for a while and re-order it during compaction?

If so, what steps does it take to make sure reads are in the correct order?

like image 441
Drew LeSueur Avatar asked Jan 01 '23 16:01

Drew LeSueur


1 Answers

Data is always sorted in any given sstable.

When a memtable is flushed to disk, that will create a new sstable, which is sorted within itself. This happens naturally since memtables store data in sorted order, so no extra sorting is needed at that point. Sorting happens on insertion into the memtable.

A read, which is using natural ordering, will have to read from all sstables which are relevant for the read, merging multiple sorted results into one sorted result. This merging happens in memory on-the-fly.

Compaction, when it kicks in, will replace multiple sstables with one, creating a merged stream much like a regular read would do.

This technique of storing data is known as a log-structured merge tree.

like image 192
Tomek Avatar answered Jan 05 '23 11:01

Tomek