We have one common requirement(data migration) to batch modify data such as user id column(change user id 001 to 002, change user id 003 to 004). but the user id field in table 1 isn’t primary key(we can’t get all rows to update due except select * from table) and in table2 is primary key(this case we can handle). So we have no methods to select all data using where cause for all tables. So how to meet this requirement? I just think out two methods: (1) select * from table with fetch size setting. Then update it. // is it right way? (2) use copy command to one CVS and then modify it and import again. // the performance is slow? Are these methods can be used in production(with > million records.) or Is there any other standard better method for this requirement ? Sstableloader? Pig? Maybe it is common requirement to modify one column all existed table so maybe existed on standard solution. No matter which method we choose at last, when migration data, how to solve new data migration issue during the past period of old data migration. In other words, how to do solve increased data migration issue? Expect your replay table1 userid(pk) name sex table2 phonenumber(pk) userid

I'm not completely clear on what you're trying to do, but you might want to look at using the spark-cassandra connector to use Spark do these tranformations. With the connector you can read entire tables into spark RDDs, do joins and transformations on fields in those RDDs, and then save the resulting RDDs back to Cassandra. So for what you describe, you'd roughly do the following steps: <ol> <li>Read table1 and table2 into RDD1 and RDD2</li> <li>Possibly do a join on userid between RDD1 and RDD2 to create RDD3</li> <li>Transform the userid field and whatever else you want to change</li> <li>Create tables in Cassandra with whatever you want to be the primary key</li> <li>Save your transformed RDDs to the new tables in Cassandra</li> </ol> This approach would scale well to millions of records since Spark is designed to work on the data in chunks if there is not enough system memory to hold everything in memory at the same time. And Spark will be able to do a lot of work in parallel on all the nodes at once as opposed to you writing a CQL client to fetch all the records and do all this work on a single client machine. The hard part would be adding Spark to your Cassandra cluster and learning how to write Spark jobs, but if this is the sort of thing you will be doing often, it might be worth the trouble.

Depending on the volume of data, you likely have 3 options: 1) <code>COPY TO</code> in CQLSH, which will use paging and create a CSV file. You can then parse that CSV using the programming language of your choice, create a new CSV with the updated IDs, truncate the table (or create a new table), and <code>COPY FROM</code> it back into the system. This will work for a few million entries, I probably wouldn't try it for a few billion. <code>COPY FROM</code> will not require knowing all of the keys in advance. 2) Use spark. Jim Meyer did a reasonable job explaining spark. Spark will scale better than <code>COPY</code> commands in CQLSH, but requires additional setup. 3) Use <code>CQLSSTableWriter</code>, <code>sstableloader</code> and streaming. Read the rows using a driver with paging (such as the datastax java driver). Use CQLSSTableWriter to transform that data and write a new OFFLINE sstables. Drop or truncate the old table, and use <code>sstableloader</code> to feed the new sstables into the cluster. This is suitable for terabytes of data, and can be parallelized if you plan ahead. Yuki Morishita does a good job documenting this approach on the Datastax blog. You won't necessarily need to know all of the keys, you can <code>SELECT DISTINCT</code> to get each row, or use <code>COPY FROM</code> to generate a CSV file.

how to do data migration in cassandra

Tags:

cassandra

We have one common requirement(data migration) to batch modify data such as user id column(change user id 001 to 002, change user id 003 to 004). but the user id field in table 1 isn’t primary key(we can’t get all rows to update due except select * from table) and in table2 is primary key(this case we can handle). So we have no methods to select all data using where cause for all tables.

So how to meet this requirement?

I just think out two methods:

(1) select * from table with fetch size setting. Then update it. // is it right way? (2) use copy command to one CVS and then modify it and import again. // the performance is slow?

Are these methods can be used in production(with > million records.) or Is there any other standard better method for this requirement ? Sstableloader? Pig?

Maybe it is common requirement to modify one column all existed table so maybe existed on standard solution.

No matter which method we choose at last, when migration data, how to solve new data migration issue during the past period of old data migration. In other words, how to do solve increased data migration issue?

Expect your replay

table1 userid(pk) name sex

table2 phonenumber(pk) userid

269

asked Dec 01 '15 07:12

jiafu

3 Answers

I'm not completely clear on what you're trying to do, but you might want to look at using the spark-cassandra connector to use Spark do these tranformations.

With the connector you can read entire tables into spark RDDs, do joins and transformations on fields in those RDDs, and then save the resulting RDDs back to Cassandra. So for what you describe, you'd roughly do the following steps:

Read table1 and table2 into RDD1 and RDD2
Possibly do a join on userid between RDD1 and RDD2 to create RDD3
Transform the userid field and whatever else you want to change
Create tables in Cassandra with whatever you want to be the primary key
Save your transformed RDDs to the new tables in Cassandra

This approach would scale well to millions of records since Spark is designed to work on the data in chunks if there is not enough system memory to hold everything in memory at the same time. And Spark will be able to do a lot of work in parallel on all the nodes at once as opposed to you writing a CQL client to fetch all the records and do all this work on a single client machine.

The hard part would be adding Spark to your Cassandra cluster and learning how to write Spark jobs, but if this is the sort of thing you will be doing often, it might be worth the trouble.

120

answered Nov 02 '22 23:11

Jim Meyer

Depending on the volume of data, you likely have 3 options:

1) COPY TO in CQLSH, which will use paging and create a CSV file. You can then parse that CSV using the programming language of your choice, create a new CSV with the updated IDs, truncate the table (or create a new table), and COPY FROM it back into the system. This will work for a few million entries, I probably wouldn't try it for a few billion. COPY FROM will not require knowing all of the keys in advance.

2) Use spark. Jim Meyer did a reasonable job explaining spark. Spark will scale better than COPY commands in CQLSH, but requires additional setup.

3) Use CQLSSTableWriter, sstableloader and streaming. Read the rows using a driver with paging (such as the datastax java driver). Use CQLSSTableWriter to transform that data and write a new OFFLINE sstables. Drop or truncate the old table, and use sstableloader to feed the new sstables into the cluster. This is suitable for terabytes of data, and can be parallelized if you plan ahead. Yuki Morishita does a good job documenting this approach on the Datastax blog. You won't necessarily need to know all of the keys, you can SELECT DISTINCT to get each row, or use COPY FROM to generate a CSV file.

answered Nov 03 '22 00:11

Jeff Jirsa

This smells like an anti-pattern.

Primary keys should be stable

Primary keys (especially the partition key) shouldn't be changing, especially globally across the dataset.

When a partition key changes, the rows will get a new token and the rows will have to be moved from their current replica nodes to new replica nodes.

When any part of the primary key changes, the rows need to be resorted.

Changing the primary key is an expensive operation. And as you are discovering, updating all of the references in other tables is also expensive.

If the field you've chosen as your primary key is not stable, then you should consider using a different, more stable, field as your primary key. Worst case, use a synthetic key (uuid or timeuuid).

I strongly suggest you revisit your data model and adjust it to support your "data migration" needs in a way that doesn't require modifying the primary key.

If you provide more details about your migration requirement then we might be able to suggest a better way to model it.

answered Nov 03 '22 01:11

Brandon

Related questions
                            
                                What is the difference between scylla read path and cassandra read path?
                            
                                Using Thrift to connect to Cassandra from .NET
                            
                                Cassandra's server starting error on OS X
                            
                                What's the meaning of NodeDiscoveryType as TOKEN_AWARE in Astyanax client?
                            
                                Mutation of 17076203 bytes is too large for the maxiumum size of 16777216
                            
                                Preparing Cassandra SELECT Statements in Python
                            
                                Cassandra is configured to lose 10 seconds of data by default?
                            
                                How to filter Cassandra result based on WRITETIME
                            
                                How to insert java.util.Date values into Cassandra date type column using Spring Data Cassandra?
                            
                                View Cassandra Partitions using CQLSH
                            
                                How to export data from Cassandra to BigQuery
                            
                                Cassandra add column after particular column
                            
                                Export/Import entire cassandra database
                            
                                In cassandra when to use decimal Vs float/double?
                            
                                Cassandra CQL: Unable to insert
                            
                                set a value to null in cassandra with cql
                            
                                Cassandra12 error - Can't open incompatible SSTable! Current version ic
                            
                                Memtable understanding
                            
                                What are the maximum number of columns allowed in Cassandra
                            
                                How do I connect to Cassandra with Dbeaver Community edition?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With