Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to configure Cassandra TimeWindowCompactionStrategy

My time series data TTLs after 1-7 days (depends on the use case). The data is immutable and ordered by timestamp (cluster by timestamp) - data is timestamped "on-write" (so new data timestamps should always be progressive)

The partition size should not exceed 10K items - usually much less ( and at most ~10MB for a full 10k items).

I didn't find any good documentation on how the compaction strategy should be configured (what parameters to take into account) so I just decided to do it like this:

compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '7', 'compaction_window_unit': 'DAYS'}

Definitely not sure that this is correct

What are the KPI I should be taking into account?

like image 909
Avner Barr Avatar asked Oct 09 '18 08:10

Avner Barr


People also ask

How do I check my compaction strategy in Cassandra?

To test the compaction strategy: Create a three-node cluster using one of the compaction strategies, then stress test the cluster using thecassandra-stress utility and measure the results. Set up a node on your existing cluster and enable the write survey mode option on the node to analyze live data.

When compaction happens in Cassandra?

Cassandra Compaction is a process of reconciling various copies of data spread across distinct SSTables. Cassandra performs compaction of SSTables as a background activity. Cassandra has to maintain fewer SSTables and fewer copies of each data row due to compactions improving its read performance.

How do I stop compaction in Cassandra?

In Cassandra 2.2 and later, a single compaction operation can be stopped with the -id option. Run nodetool compactionstats to find the compaction ID.


1 Answers

There is no single right answer:

As a result of your configuration, data will be compacted together if it was inserted in the last 7 days. The biggest advantage of TWCS is that it can expire entire SSTables without even reading them because it knows that all the data inside the SSTable is already expired.

In this case, the data that you TTLd in 1 day cannot be expired yet, because it will be lumped together in a 7 day window. In the worst case, your SSTable will have a mutation that was just inserted in the end of the 7-day window, so the entire SSTable will be kept around for 7 more days until that one mutation expires.

This sounds suboptimal, but at least you will be able to serve all your reads for data in that window from a single SSTable. Going the other way around, you would set, for instance, the window to one day. This would make your data expire a lot faster but for the data that is alive for 7 days you would now be touching 7 SSTables instead of one.

Summary:

Larger time windows: slower expiration, faster reads for live data Smaller time windows: faster expiration, slower reads for live data.

As with most things in life, the truth is in the middle! While both options would work - and you now understand the trade offs, the best window is probably somewhere in the middle of 1 and 7.

like image 164
Glauber Costa Avatar answered Sep 23 '22 17:09

Glauber Costa