Currently, I have a cassandra column family with large rows of data, to say more than 100,000. Now, I'd like to remove all data in this column family and the problem came up:
After all data is removed, I execute a lookup query in this column family, the cassandra will take tens of seconds to return a empty query result. And the time cost will increase Linearly when the original data is larger
It is caused by the tombstone feature while deleting data from the cassandra database. The lookup speed won't recover to normal until the next GC is fired. See Cassandra Distributed Deletes.
Because such query operations are frequently used in my system, I cannot bear the huge latency up to a few seconds.
Would you please give me a solution to this problem?
This sounds like a very bad way to use a database. Populate it, empty it, repeat. One way you can solve your problem is by using different CF names each time, as in when you empty the data and start repopulating it, create a new column family and use that and just drop the other colum family however this is hacky.
I'd suggest using compaction (gets rid of all the tombstones it can detect) to solve your problem, it is CPU intensive but it's better than waiting for tens of seconds for queries to respond. You can make the task less intensive on your machine by providing the specific ks & cf you want to compact:
./nodetool compact <ks_name> <cf_name>
Ritchard's point is a good one, gc_grace_seconds is set to 10 days by default so you will probably have to tweak this to allow for compaction to get rid of tombstones.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With