I have Apache Cassandra working on 4 VMs in Google Cloud. I considered it too expensive and want to export all data to BigQuery. There are about 2 TB (60 milliards rows) in Cassandra. Any suggestions how can I do it?
Thanks in advance.
We decided to move 5 years of data from Apache Cassandra to Google BigQuery. The problem was not just transferring the data or export/import, the issue was the very old Cassandra!
After extensive research, we have planned the migration to export data to csv and then upload in Google Cloud Storage for importing in Big Query.
The pain was the way Cassandra 1.1 deal with large number of records! There is no pagination so at some point your gonna run out of something! If not mistaken, pagination is introduced since version 2.2.
After all my attempts to upgrade to latest version 3.4 failed I decide to try other versions and luckily the version 2.2 worked! By working I mean I were able to follow the upgrading steps to end and the data were accessible.
Because I could not get any support for direct upgrade and my attempts to simply upgrade to 2.2 also failed. So I had no choice but to upgrade to 2.0 and then upgrade it to 2.2. Because this is extremely delicate task I rather just forward you to official website and only then give you the summary. Please make sure you check docs.datastax.com and follow their instructions.
To give an overview, you are going to do these steps:
It is very important to successfully upgrade your SSTable before proceeding to next step. Simply use
nodetool upgradesstables
Drain the nodes using
nodetool drain
Then simply stop the node
Edit /etc/yum.repos.d/datastax.repo
[datastax]
name = DataStax Repo for Apache Cassandra
baseurl = https://rpm.datastax.com/community
enabled = 1
gpgcheck = 0
And then install and start the service:
yum install dsc20
service cassandra start
Once you are upgrade to Cassandra 2+ you can export the data to csv without having pagination or crashing issue.
Just for the records, a few commands to get the necessary information about the data structure is as follow:
cqlsh -u username -p password
describe tables;
describe table abcd;
describe schema;
And once we know the tables we want to export we just use them alongside its keyspace. First add all your commands in one file to create a batch.
vi commands.list
For example a sample command to export one table:
COPY keyspace.tablename TO '/backup/export.csv';
And finally run the commands from the file:
cqlsh -u username -p password -f /backup/commands.list
So by now, you have exported the tables to csv file(s). All you need to do now is uploading the files to Google Cloud Storage:
gsutil rsync /backup gs://bucket
Later on you can use Google API to import the csv files to Google BigQuery. You may check out the Google documentations for this in cloud.google.com
You can also accomplish this task by creating a data pipeline using Apache Beam and running this pipeline via Cloud Dataflow.
Beam includes built-in IO connectors for Apache Cassandra and Google BigQuery
CassandraIO (Java)
With a Beam-based approach you will also have additional options to perform additional data cleaning and/or transformations that you may want to apply as part of your export process.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With