We have one common requirement(data migration) to batch modify data such as user id column(change user id 001 to 002, change user id 003 to 004). but the user id field in table 1 isn’t primary key(we can’t get all rows to update due except select * from table) and in table2 is primary key(this case we can handle). So we have no methods to select all data using where cause for all tables.
So how to meet this requirement?
I just think out two methods:
(1) select * from table with fetch size setting. Then update it. // is it right way? (2) use copy command to one CVS and then modify it and import again. // the performance is slow?
Are these methods can be used in production(with > million records.) or Is there any other standard better method for this requirement ? Sstableloader? Pig?
Maybe it is common requirement to modify one column all existed table so maybe existed on standard solution.
No matter which method we choose at last, when migration data, how to solve new data migration issue during the past period of old data migration. In other words, how to do solve increased data migration issue?
Expect your replay
table1 userid(pk) name sex
table2 phonenumber(pk) userid
You can use Change Data Capture (CDC) to copy the data from Cassandra to PostgreSQL as and when there is a change in Cassandra data. One option is to use Kafka Connect with appropriate coonectors.
In Cassandra, data distribution and replication go together. Data is organized by table and identified by a primary key, which determines which node the data is stored on. Replicas are copies of rows. When data is first written, it is also referred to as a replica.
I'm not completely clear on what you're trying to do, but you might want to look at using the spark-cassandra connector to use Spark do these tranformations.
With the connector you can read entire tables into spark RDDs, do joins and transformations on fields in those RDDs, and then save the resulting RDDs back to Cassandra. So for what you describe, you'd roughly do the following steps:
This approach would scale well to millions of records since Spark is designed to work on the data in chunks if there is not enough system memory to hold everything in memory at the same time. And Spark will be able to do a lot of work in parallel on all the nodes at once as opposed to you writing a CQL client to fetch all the records and do all this work on a single client machine.
The hard part would be adding Spark to your Cassandra cluster and learning how to write Spark jobs, but if this is the sort of thing you will be doing often, it might be worth the trouble.
Depending on the volume of data, you likely have 3 options:
1) COPY TO
in CQLSH, which will use paging and create a CSV file. You can then parse that CSV using the programming language of your choice, create a new CSV with the updated IDs, truncate the table (or create a new table), and COPY FROM
it back into the system. This will work for a few million entries, I probably wouldn't try it for a few billion. COPY FROM
will not require knowing all of the keys in advance.
2) Use spark. Jim Meyer did a reasonable job explaining spark. Spark will scale better than COPY
commands in CQLSH, but requires additional setup.
3) Use CQLSSTableWriter
, sstableloader
and streaming. Read the rows using a driver with paging (such as the datastax java driver). Use CQLSSTableWriter to transform that data and write a new OFFLINE sstables. Drop or truncate the old table, and use sstableloader
to feed the new sstables into the cluster. This is suitable for terabytes of data, and can be parallelized if you plan ahead. Yuki Morishita does a good job documenting this approach on the Datastax blog. You won't necessarily need to know all of the keys, you can SELECT DISTINCT
to get each row, or use COPY FROM
to generate a CSV file.
This smells like an anti-pattern.
Primary keys (especially the partition key) shouldn't be changing, especially globally across the dataset.
When a partition key changes, the rows will get a new token and the rows will have to be moved from their current replica nodes to new replica nodes.
When any part of the primary key changes, the rows need to be resorted.
Changing the primary key is an expensive operation. And as you are discovering, updating all of the references in other tables is also expensive.
If the field you've chosen as your primary key is not stable, then you should consider using a different, more stable, field as your primary key. Worst case, use a synthetic key (uuid or timeuuid).
I strongly suggest you revisit your data model and adjust it to support your "data migration" needs in a way that doesn't require modifying the primary key.
If you provide more details about your migration requirement then we might be able to suggest a better way to model it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With