Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scylla/Cassandra: compaction strategy for time series data without TTL

We're actually using scylla, but I still included "cassandra" in the title because the question should be relevant to it as well.

We have a time-series workload without TTL (data never expires). For the context, table schema looks roughly like this:

CREATE TABLE events (
    entity INT,
    time_partition TIMESTAMP,
    event_time TIMESTAMP,

    a TEXT,
    b TEXT,
    c TEXT,
    PRIMARY KEY((entity, time_partition), event_time)
);

Where time_partition is just the same as event_time but always truncated to an hour, so events for the same entity and within the same hour go to the same partition.

I'm trying to figure what compaction strategy should we use. At first I thought that we should use TimeWindowCompactionStrategy (TWCS): if the time window is 1 day, it will basically mean that we'll have per-day sstables, which makes sense to me; but then I found out that various docs say that TWCS isn't a good fit for never-expiring data. E.g. this doc on datastax.com says:

not appropriate for data without a TTL workload, as storage will grow without bound.

Even though that storage will indeed grow "without bound" because we never want to delete old data, but isn't it the same for any other strategy? If we never delete data, it means it will keep growing. It's just a matter of how we organize this ever-growing dataset, and I'm not sure why TWCS isn't a good fit. Is the big number of files a problem? E.g. if we make window to be 1 day, then for 10 years worth of data it'll be roughly 3650 files.

Would appreciate any suggestions.

like image 516
Dmitry Frank Avatar asked Oct 19 '25 04:10

Dmitry Frank


1 Answers

In the Scylla case you'll have more files since every shard (core) handles its own compaction. However, even in 10 years, you can have 36k files, it's not a problem w/ Scylla.

like image 181
dor laor Avatar answered Oct 22 '25 08:10

dor laor