Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Export large amount of data from Cassandra to CSV

I'm using Cassandra 2.0.9 for store quite big amounts of data, let's say 100Gb, in one column family. I would like to export this data to CSV in fast way. I tried:

  • sstable2json - it produces quite big json files which are hard to parse - because tool puts data in one row and uses complicated schema (ex. 300Mb Data file = ~2Gb json), it takes a lot of time to dump and Cassandra likes to change source file names according its internal mechanism
  • COPY - causes timeouts on quite fast EC2 instances for big number of records
  • CAPTURE - like above, causes timeouts
  • reads with pagination - I used timeuuid for it, but it returns about 1,5k records per second

I use Amazon Ec2 instance with fast storage, 15 Gb of RAM and 4 cores

Is there any better option for export gigabytes of data from Cassandra to CSV?

like image 925
KrzysztofZalasa Avatar asked Jul 22 '14 19:07

KrzysztofZalasa


People also ask

How do I export Cassandra data to CSV?

csv'; cqlsh> select *from user; Now capturing query output to '/home/Desktop/user. csv'. Use DevCenter and execute a query. Right click on the output and select "Copy All as CSV" to paste the output in CSV.

How do I export data from Cassandra?

To copy a specific rows of a table used the following cqlsh query given below. First, export data from table and then truncate after these two steps follow these steps given below. COPY Data FROM STDIN; After executing above cqlsh query the line prompt changes to [copy] let's have a look.

How do you take Cassandra dump?

Roughly you need to perform the following steps: take a "snapshot" of the keyspace using: nodetool snapshot <keyspace-name> . This is run on the server, where you want to take generates a "snapshot". It will do that, storing a "snapshot" for each table of the keyspace.

How do I export and import Cassandra Keyspace?

You can store the output in a file, then import with 'cassandra-cli -f filename'. If using cqlsh, you can use the 'describe schema' command. You can restrict to a keyspace with 'describe keyspace keyspace'. You can save this to a file then import with 'cqlsh -f filename'.


3 Answers

Update for 2020th: DataStax provides a special tool called DSBulk for loading and unloading of data from Cassandra (starting with Cassandra 2.1), and DSE (starting with DSE 4.7/4.8). In simplest case, the command line looks as following:

dsbulk unload -k keyspace -t table -url path_to_unload 

DSBulk is heavily optimized for loading/unloading operations, and has a lot of options, including import/export from/to compressed files, providing the custom queries, etc.

There is a series of blog posts about DSBulk, that could provide more information & examples: 1, 2, 3, 4, 5, 6

like image 88
Alex Ott Avatar answered Sep 28 '22 04:09

Alex Ott


Because using COPY will be quite challenging when you are trying to export a table with millions of rows from Cassandra, So what I have done is to create simple tool to get the data chunk by chunk (paginated) from cassandra table and export it to CSV.

Look at my example solution using java library from datastax.

like image 21
Firman Gautama Avatar answered Sep 28 '22 03:09

Firman Gautama


Inspired by @user1859675 's answer, Here is how we can export data from Cassandra using Spark

val cassandraHostNode = "10.xxx.xxx.x5,10.xxx.xxx.x6,10.xxx.xxx.x7";
val spark = org.apache.spark.sql.SparkSession
                                    .builder
                                    .config("spark.cassandra.connection.host",  cassandraHostNode)
                                    .appName("Awesome Spark App")
                                    .master("local[*]")
                                    .getOrCreate()

val dataSet = spark.read.format("org.apache.spark.sql.cassandra")
                        .options(Map("table" -> "xxxxxxx", "keyspace" -> "xxxxxxx"))
                        .load()

val targetfilepath = "/opt/report_values/"
dataSet.write.format("csv").save(targetfilepath)  // Spark 2.x

You will need "spark-cassandra-connector" in your classpath for this to work.
The version I am using is below

    <groupId>com.datastax.spark</groupId>
    <artifactId>spark-cassandra-connector_2.11</artifactId>
    <version>2.3.2</version>
like image 42
Remis Haroon - رامز Avatar answered Sep 28 '22 04:09

Remis Haroon - رامز