Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to do data migration in cassandra

Tags:

cassandra

We have one common requirement(data migration) to batch modify data such as user id column(change user id 001 to 002, change user id 003 to 004). but the user id field in table 1 isn’t primary key(we can’t get all rows to update due except select * from table) and in table2 is primary key(this case we can handle). So we have no methods to select all data using where cause for all tables.

So how to meet this requirement?

I just think out two methods:

(1) select * from table with fetch size setting. Then update it. // is it right way? (2) use copy command to one CVS and then modify it and import again. // the performance is slow?

Are these methods can be used in production(with > million records.) or Is there any other standard better method for this requirement ? Sstableloader? Pig?

Maybe it is common requirement to modify one column all existed table so maybe existed on standard solution.

No matter which method we choose at last, when migration data, how to solve new data migration issue during the past period of old data migration. In other words, how to do solve increased data migration issue?

Expect your replay

table1 userid(pk) name sex

table2 phonenumber(pk) userid

like image 269
jiafu Avatar asked Dec 01 '15 07:12

jiafu


People also ask

How do I transfer data from Cassandra to Postgres?

You can use Change Data Capture (CDC) to copy the data from Cassandra to PostgreSQL as and when there is a change in Cassandra data. One option is to use Kafka Connect with appropriate coonectors.

How does Cassandra distribute data?

In Cassandra, data distribution and replication go together. Data is organized by table and identified by a primary key, which determines which node the data is stored on. Replicas are copies of rows. When data is first written, it is also referred to as a replica.


3 Answers

I'm not completely clear on what you're trying to do, but you might want to look at using the spark-cassandra connector to use Spark do these tranformations.

With the connector you can read entire tables into spark RDDs, do joins and transformations on fields in those RDDs, and then save the resulting RDDs back to Cassandra. So for what you describe, you'd roughly do the following steps:

  1. Read table1 and table2 into RDD1 and RDD2
  2. Possibly do a join on userid between RDD1 and RDD2 to create RDD3
  3. Transform the userid field and whatever else you want to change
  4. Create tables in Cassandra with whatever you want to be the primary key
  5. Save your transformed RDDs to the new tables in Cassandra

This approach would scale well to millions of records since Spark is designed to work on the data in chunks if there is not enough system memory to hold everything in memory at the same time. And Spark will be able to do a lot of work in parallel on all the nodes at once as opposed to you writing a CQL client to fetch all the records and do all this work on a single client machine.

The hard part would be adding Spark to your Cassandra cluster and learning how to write Spark jobs, but if this is the sort of thing you will be doing often, it might be worth the trouble.

like image 120
Jim Meyer Avatar answered Nov 02 '22 23:11

Jim Meyer


Depending on the volume of data, you likely have 3 options:

1) COPY TO in CQLSH, which will use paging and create a CSV file. You can then parse that CSV using the programming language of your choice, create a new CSV with the updated IDs, truncate the table (or create a new table), and COPY FROM it back into the system. This will work for a few million entries, I probably wouldn't try it for a few billion. COPY FROM will not require knowing all of the keys in advance.

2) Use spark. Jim Meyer did a reasonable job explaining spark. Spark will scale better than COPY commands in CQLSH, but requires additional setup.

3) Use CQLSSTableWriter, sstableloader and streaming. Read the rows using a driver with paging (such as the datastax java driver). Use CQLSSTableWriter to transform that data and write a new OFFLINE sstables. Drop or truncate the old table, and use sstableloader to feed the new sstables into the cluster. This is suitable for terabytes of data, and can be parallelized if you plan ahead. Yuki Morishita does a good job documenting this approach on the Datastax blog. You won't necessarily need to know all of the keys, you can SELECT DISTINCT to get each row, or use COPY FROM to generate a CSV file.

like image 30
Jeff Jirsa Avatar answered Nov 03 '22 00:11

Jeff Jirsa


This smells like an anti-pattern.

Primary keys should be stable

Primary keys (especially the partition key) shouldn't be changing, especially globally across the dataset.

When a partition key changes, the rows will get a new token and the rows will have to be moved from their current replica nodes to new replica nodes.

When any part of the primary key changes, the rows need to be resorted.

Changing the primary key is an expensive operation. And as you are discovering, updating all of the references in other tables is also expensive.

If the field you've chosen as your primary key is not stable, then you should consider using a different, more stable, field as your primary key. Worst case, use a synthetic key (uuid or timeuuid).

I strongly suggest you revisit your data model and adjust it to support your "data migration" needs in a way that doesn't require modifying the primary key.

If you provide more details about your migration requirement then we might be able to suggest a better way to model it.

like image 27
Brandon Avatar answered Nov 03 '22 01:11

Brandon