I have a single node cassandra cluster, I use the current minute as partition key and insert rows with TTL of 12 hours.
I see a couple of issue I can't explain
/var/lib/cassandra/data/<key_space>/<table_name>
contains multiple files, lots of them are really old (way older then 12 hours, something like 2 days)log:
WARN [SharedPool-Worker-2] 2015-01-26 10:51:39,376 SliceQueryFilter.java:236 - Read 0 live and 1571042 tombstoned cells in <table_name>_name (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:40,472 SliceQueryFilter.java:236 - Read 0 live and 1557919 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:41,630 SliceQueryFilter.java:236 - Read 0 live and 1589764 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:42,877 SliceQueryFilter.java:236 - Read 0 live and 1582163 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:44,081 SliceQueryFilter.java:236 - Read 0 live and 1550989 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:44,869 SliceQueryFilter.java:236 - Read 0 live and 1566246 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:45,582 SliceQueryFilter.java:236 - Read 0 live and 1577906 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:46,443 SliceQueryFilter.java:236 - Read 0 live and 1571493 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:47,701 SliceQueryFilter.java:236 - Read 0 live and 1559448 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:49,255 SliceQueryFilter.java:236 - Read 0 live and 1574936 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
I've tried multiple compaction strategies, multithreaded compaction, I've tried running compaction manually with nodetool, also, I've tried forcing garbage collection with jmx.
One of my guesses is that the compaction doesn't delete tombstones files
Any ideas how to avoid disk space from getting too big, my biggest concern is running out of space, I'd rather store less (by making the ttl smaller but currently that doesn't help)
Tombstones will be preserved for 10 days using the default configuration. The reason for this is to make sure that offline nodes will be able to catch up with deletes when they join the cluster again. You can configure this value by setting the gc_grace_seconds setting.
I'm assuming you are using the timestamp as a clustering column within each partition when you say you are using the minute as the partition key, along with a TTL of 12 hours when you do the insert. This will build up tombstones in each partition since you are never deleting the entire row (i.e. a whole minute partition).
Suppose your keyspace is called k1 and your table is called t2, then you can run:
nodetool flush k1 t2
nodetool compact k1 t2
sstable2json /var/lib/cassandra/data/k1/t2/k1-t2-jb-<last version>-Data.db
then you'll see all the tombstones like this (marked with a "d" for deleted):
{"key": "00000003","columns": [["4:","54c7b514",1422374164512000,"d"], ["5:","54c7b518",1422374168501000,"d"], ["6:","54c7b51b",1422374171987000,"d"]]}
Now if you go and delete that row (i.e. delete from k1.t2 where key=3;), then do the flush, compact, and sstable2json again, you'll see it change to:
{"key": "00000003","metadata": {"deletionInfo": {"markedForDeleteAt":1422374340312000,"localDeletionTime":1422374340}},"columns": []}
So you see all the tombstones are gone and Cassandra only has to remember that the whole row was deleted at a certain time instead of little bits and pieces of the row being deleted at certain times.
Another way to get rid of the tombstones is to truncate the whole table. When you do that, Cassandra only needs to remember that the whole table was truncated at a certain time, and so no longer needs to keep tombstones prior to that time (since tombstones are used to tell other nodes that certain data was deleted, and if you can say the whole table was emptied at time x, then the details prior to that no longer matter).
So how could you apply this in your situation you ask. Well, you could use the hour and minute as your partition key, and then once an hour run a cron job to delete all the rows from 13 hours ago. Then on the next compaction, all the tombstones for that partition would be removed.
Or keep a separate table for each hour, and then truncate the table from 13 hours ago each hour using a cron job.
Another strategy that is sometimes useful is to "re-use" clustering keys. For example, if you were inserting data once per second, instead of having a high resolution timestamp as a clustering key, you could use the time modulo 60 seconds as the clustering key and keep the more unique timestamp as just a data field. So within each minute partition you would be changing tombstones (or outdated information) from yesterday back into live rows today, and then you wouldn't accumulate tombstones over many days.
So hopefully that gives you some ideas for things to try. Usually when you run into a tombstone problem, it's a sign that you need to re-think your schema a little bit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With