Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to archive and purge Cassandra data

I have a cassandra cluster with multiple data centres. I want to archive data monthly and purge that data. There are numerous articles of backing up and restoring but not where its mentioned to archive data in cassandra cluster.

Can someone please let me know how can I archive my data in cassandra cluster monthly and purge the data.

like image 609
Nipun Avatar asked Sep 07 '15 10:09

Nipun


People also ask

How do I delete old Cassandra data?

Cassandra's processes for deleting data are designed to improve performance, and to work with Cassandra's built-in properties for data distribution and fault-tolerance. Cassandra treats a delete as an insert or upsert. The data being added to the partition in the DELETE command is a deletion marker called a tombstone.

How is Cassandra data stored?

When a write occurs, Cassandra stores the data in a memory structure called memtable, and to provide configurable durability, it also appends writes to the commit log on disk. The commit log receives every write made to a Cassandra node, and these durable writes survive permanently even if power fails on a node.

Can Cassandra lose data?

The data on Apache Cassandra is replicated. Although a complete failure might be rare, data might get corrupted. In some cases, the hardware might crash, and the data might be lost. Therefore, it is necessary to take regular backups by taking snapshots of all Cassandra nodes.


2 Answers

I think there is no such tool that can be used for archive cassandra.You have to write either Spark Jobs or map reduce job that use CqlInputFormat to archive the data.You can follow below links that help you to understand how people are archiving data in cassandra:

[1] - [http://docs.wso2.org/display/BAM240/Archive+Cassandra+Data]

[2] - http://docs.wso2.org/pages/viewpage.action?pageId=32345660

[3] - http://accelconf.web.cern.ch/AccelConf/ICALEPCS2013/papers/tuppc004.pdf

There is also a way using which you can turn on incremental backup in cassandra which can be used like CDC.

like image 165
Sachin Janani Avatar answered Sep 28 '22 10:09

Sachin Janani


It is the best practice to use timewindow compaction strategy and set the window of monthly on your tables along with TTL(month), so that data older than a month can be purged.

If you write a purge job that does this work of deletion (on tables which do not have correct compaction strategy applied) then this can impact the cluster performance because searching the data on date/month basic will overwhelm the cluster.

I have experienced this, where we ultimately have to go back changing the structure of tables and altered the compaction strategy. That is why having the table design right at the first place is very important. We need to think about (in the beginning itself) not only how the data will be inserted and read in tables but also how it will be deleted and then frame the keys, compaction, ttl, etc.

For archiving just write a few lines of code to read data from Cassandra and put it to you archival location.

Let me know if this help in getting the end result you want or if you have further question that I can help with.

like image 32
Chandan Goel Avatar answered Sep 28 '22 10:09

Chandan Goel