Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best practices for cleaning up Cassandra incremental backup folders

We have incremental backup on our Cassandra cluster. The "backups" folders under the data folders now contain a lot of data and some of them have millions of files.

According to the documentation: "DataStax recommends setting up a process to clear incremental backup hard-links each time a new snapshot is created."

It's not clear to me what the best way is to clear out these files. Can they all just be deleted when a snapshot is created, or should we delete files that are older than a certain period?

My thought was, just to be on the safe side, to run a regular script to delete files more than 30 days old:

find [Cassandra data root]/*/*/backups -type f -mtime +30 -delete

Am I being too careful? We're not concerned about having a long backup history.

Thanks.

like image 402
John Douglass Avatar asked Jan 06 '15 17:01

John Douglass


People also ask

How do I restore an incremental backup in cassandra?

By default, incremental backup is disabled in Cassandra. This can be enabled by changing the value of “incremental_backups” to “true” in the cassandra. yaml file. Once enabled, Cassandra creates a hard link to each memtable flushed to SSTable to a backup's directory under the keyspace data directory.

What is cassandra Medusa?

Medusa is an Apache Cassandra backup system. Medusa is a command-line tool that offers the following features: single node backup. single node restore. cluster-wide in place restore.

How does incremental forever backup work?

An incremental forever backup strategy minimizes backup windows while providing faster recovery of your data. Data Protection for VMware provides a backup strategy called incremental forever. Rather than scheduling weekly (periodic) full backups, this backup solution requires only one initial full backup.


1 Answers

You are probably being too careful, though that's not always a bad thing, but there are a number of considerations. A good pattern is to have multiple snapshots (for example weekly snapshots going back to some period) and all backups during that time period so you can restore to known states. For example, if for whatever reason your most recent snapshot doesn't work for whatever reason, if you still have your previous snapshot + all sstables since then, you can use that.

You can delete all created backups after your snapshot as the act of doing the snapshot flushes and hard links all sstables to a snapshots directory. Just make sure your snapshots are actually happening and completing (it's a pretty solid process since it hard links) before getting rid of old snapshots & deleting backups.

You should also make sure to test your restore process as that'll give you a good idea of what you will need. You should be able to restore from your last snapshot + the sstables backed up since that time. Would be a good idea to fire up a new cluster and try restoring data from your snapshots + backups, or maybe try out this process in place in a test environment.

I like to point to this article: 'Cassandra and Backups' as a good run down of backing up and restoring cassandra.

like image 190
Andy Tolbert Avatar answered Oct 06 '22 22:10

Andy Tolbert