How should I copy a keyspace within a cluster

Tags:

cassandra

I have a keyspace populated with data that was expensive to generate. I want two copies of this data within my cluster. I would like to end up with two keyspaces: lets call them mydata and mydatabackup, both of which contain identical data (I don't mind if the Cassandra timestamps are different).

Is there an easy way to do this? Closest thing I can find to an answer is to use sstable2json and json2sstable as suggested in response to a similar question? Is there a better way?

924

asked Sep 13 '13 16:09

lorcan

1 Answers

" Is there a better way?"

All Cassandra data are stored in the data/ folder (check config value data_file_directories in cassandra.yaml). You may also check the saved_caches_directory and commitlog_directory config.

Inside the data folder, you'll have

One folder per keyspace
One folder for system keyspace
Some folder for authentication etc..

Inside each keyspace folder, you'll have
*-Data.db files which contain your real data
*-Filter.db files
*-Index.db files for index
...

To replicate data, you do a plain copy of those folders.

In our team, the ops use a crontab to schedule regular backup of Cassandra data this way.

Note: sometimes, you may miss live data which are still in memory or in memtable and not flushed yet to disk. You can trigger a full compaction before backuping data files. But full compaction may hurt you perf so be careful

Better answer: use the provided tool to take a snapshot of you DB:

http://www.datastax.com/docs/1.0/operations/backup_restore

146

answered Oct 21 '22 03:10

doanduyhai

Related questions
                            
                                Simultaneous repairs cause repair to hang
                            
                                How to create RDD object on cassandra data using pyspark
                            
                                Cassandra 3.10 debug.log contains frequent "FailureDetector.java:457 - Ignoring interval time of..."
                            
                                Apache Cassandra and Windows
                            
                                Cassandra table with multiple counter columns
                            
                                Cassandra Powershell Issue
                            
                                Cassandra update value in one of the clustering column
                            
                                Dynamically adding new nodes in Cassandra
                            
                                Cassandra Error message: Not marking nodes down due to local pause. Why?
                            
                                Is there a reason that Cassandra doesn't have Geospatial support?
                            
                                Clarifications about nodetool repair -pr
                            
                                Cassandra Allow filtering
                            
                                Mapping Cassandra Super Columns
                            
                                Paging Resultsets in Cassandra with compound primary keys - Missing out on rows
                            
                                Combine results from batch RDD with streaming RDD in Apache Spark
                            
                                Cassandra IN query not working if table has SET type column
                            
                                Streaming data from Kafka into Cassandra in real time
                            
                                modelling cassandra tables for upsert and select query
                            
                                Database that consumes less disk space

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With