Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to export data from Cassandra to BigQuery

I have Apache Cassandra working on 4 VMs in Google Cloud. I considered it too expensive and want to export all data to BigQuery. There are about 2 TB (60 milliards rows) in Cassandra. Any suggestions how can I do it?

Thanks in advance.

like image 282
Aleksandr Sosenko Avatar asked Oct 18 '17 15:10

Aleksandr Sosenko


2 Answers

We decided to move 5 years of data from Apache Cassandra to Google BigQuery. The problem was not just transferring the data or export/import, the issue was the very old Cassandra!

After extensive research, we have planned the migration to export data to csv and then upload in Google Cloud Storage for importing in Big Query.

The pain was the way Cassandra 1.1 deal with large number of records! There is no pagination so at some point your gonna run out of something! If not mistaken, pagination is introduced since version 2.2.

After all my attempts to upgrade to latest version 3.4 failed I decide to try other versions and luckily the version 2.2 worked! By working I mean I were able to follow the upgrading steps to end and the data were accessible.

Because I could not get any support for direct upgrade and my attempts to simply upgrade to 2.2 also failed. So I had no choice but to upgrade to 2.0 and then upgrade it to 2.2. Because this is extremely delicate task I rather just forward you to official website and only then give you the summary. Please make sure you check docs.datastax.com and follow their instructions.

To give an overview, you are going to do these steps:

  1. Making sure all nodes are stable and there is no dead nodes.
  2. Make backup (your SSTables, configurations and etc)
  3. It is very important to successfully upgrade your SSTable before proceeding to next step. Simply use

    nodetool upgradesstables

  4. Drain the nodes using

    nodetool drain

  5. Then simply stop the node

  6. Install the new version (I will explain fresh installation later in this document)
  7. Simply do the config as your old Cassandra, start it and upgradesstables again (as in step 3) for each node. Installing Cassandra:

Edit /etc/yum.repos.d/datastax.repo

[datastax]
name = DataStax Repo for Apache Cassandra
baseurl = https://rpm.datastax.com/community
enabled = 1
gpgcheck = 0

And then install and start the service:

yum install dsc20
service cassandra start

Once you are upgrade to Cassandra 2+ you can export the data to csv without having pagination or crashing issue.

Just for the records, a few commands to get the necessary information about the data structure is as follow:

cqlsh -u username -p password
describe tables;
describe table abcd;
describe schema;

And once we know the tables we want to export we just use them alongside its keyspace. First add all your commands in one file to create a batch.

vi commands.list

For example a sample command to export one table:

COPY keyspace.tablename TO '/backup/export.csv';

And finally run the commands from the file:

cqlsh -u username -p password -f /backup/commands.list

So by now, you have exported the tables to csv file(s). All you need to do now is uploading the files to Google Cloud Storage:

gsutil rsync /backup gs://bucket

Later on you can use Google API to import the csv files to Google BigQuery. You may check out the Google documentations for this in cloud.google.com

like image 86
Pentium10 Avatar answered Oct 04 '22 01:10

Pentium10


You can also accomplish this task by creating a data pipeline using Apache Beam and running this pipeline via Cloud Dataflow.

Beam includes built-in IO connectors for Apache Cassandra and Google BigQuery

CassandraIO (Java)

With a Beam-based approach you will also have additional options to perform additional data cleaning and/or transformations that you may want to apply as part of your export process.

like image 39
Andrew Mo Avatar answered Oct 04 '22 01:10

Andrew Mo