I'm using Cassandra 2.0.9 for store quite big amounts of data, let's say 100Gb, in one column family. I would like to export this data to CSV in fast way. I tried: <ul> <li> sstable2json - it produces quite big json files which are hard to parse - because tool puts data in one row and uses complicated schema (ex. 300Mb Data file = ~2Gb json), it takes a lot of time to dump and Cassandra likes to change source file names according its internal mechanism</li> <li> COPY - causes timeouts on quite fast EC2 instances for big number of records</li> <li> CAPTURE - like above, causes timeouts</li> <li> reads with pagination - I used timeuuid for it, but it returns about 1,5k records per second</li> </ul> I use Amazon Ec2 instance with fast storage, 15 Gb of RAM and 4 cores Is there any better option for export gigabytes of data from Cassandra to CSV?

Update for 2020th: DataStax provides a special tool called DSBulk for loading and unloading of data from Cassandra (starting with Cassandra 2.1), and DSE (starting with DSE 4.7/4.8). In simplest case, the command line looks as following: <pre class="prettyprint"><code>dsbulk unload -k keyspace -t table -url path_to_unload </code></pre> DSBulk is heavily optimized for loading/unloading operations, and has a lot of options, including import/export from/to compressed files, providing the custom queries, etc. There is a series of blog posts about DSBulk, that could provide more information & examples: 1, 2, 3, 4, 5, 6

Inspired by @user1859675 's answer, Here is how we can export data from Cassandra using Spark <pre class="prettyprint"><code>val cassandraHostNode = "10.xxx.xxx.x5,10.xxx.xxx.x6,10.xxx.xxx.x7"; val spark = org.apache.spark.sql.SparkSession .builder .config("spark.cassandra.connection.host", cassandraHostNode) .appName("Awesome Spark App") .master("local[*]") .getOrCreate() val dataSet = spark.read.format("org.apache.spark.sql.cassandra") .options(Map("table" -> "xxxxxxx", "keyspace" -> "xxxxxxx")) .load() val targetfilepath = "/opt/report_values/" dataSet.write.format("csv").save(targetfilepath) // Spark 2.x </code></pre> You will need "<code>spark-cassandra-connector</code>" in your classpath for this to work. The version I am using is below <pre class="prettyprint"><code> <groupId>com.datastax.spark</groupId> <artifactId>spark-cassandra-connector_2.11</artifactId> <version>2.3.2</version> </code></pre>

Export large amount of data from Cassandra to CSV

3 Answers

Update for 2020th: DataStax provides a special tool called DSBulk for loading and unloading of data from Cassandra (starting with Cassandra 2.1), and DSE (starting with DSE 4.7/4.8). In simplest case, the command line looks as following:

dsbulk unload -k keyspace -t table -url path_to_unload

DSBulk is heavily optimized for loading/unloading operations, and has a lot of options, including import/export from/to compressed files, providing the custom queries, etc.

There is a series of blog posts about DSBulk, that could provide more information & examples: 1, 2, 3, 4, 5, 6

answered Sep 28 '22 04:09

Alex Ott

Because using COPY will be quite challenging when you are trying to export a table with millions of rows from Cassandra, So what I have done is to create simple tool to get the data chunk by chunk (paginated) from cassandra table and export it to CSV.

Look at my example solution using java library from datastax.

answered Sep 28 '22 03:09

Firman Gautama

Inspired by @user1859675 's answer, Here is how we can export data from Cassandra using Spark

val cassandraHostNode = "10.xxx.xxx.x5,10.xxx.xxx.x6,10.xxx.xxx.x7";
val spark = org.apache.spark.sql.SparkSession
                                    .builder
                                    .config("spark.cassandra.connection.host",  cassandraHostNode)
                                    .appName("Awesome Spark App")
                                    .master("local[*]")
                                    .getOrCreate()

val dataSet = spark.read.format("org.apache.spark.sql.cassandra")
                        .options(Map("table" -> "xxxxxxx", "keyspace" -> "xxxxxxx"))
                        .load()

val targetfilepath = "/opt/report_values/"
dataSet.write.format("csv").save(targetfilepath)  // Spark 2.x

You will need "spark-cassandra-connector" in your classpath for this to work.
The version I am using is below

    <groupId>com.datastax.spark</groupId>
    <artifactId>spark-cassandra-connector_2.11</artifactId>
    <version>2.3.2</version>

answered Sep 28 '22 04:09

Remis Haroon - رامز

Related questions
                            
                                Pandas read_csv, reading a boolean with missing values specified as an int
                            
                                How do I import a csv into chart.js?
                            
                                How to get the number of rows in a Pandas chunk?
                            
                                How to save the encoded output in Keras
                            
                                How to use a CSV field to define the node label in a LOAD statement
                            
                                Select CSV columns in Powershell where header name contains a specific string
                            
                                GetMapping to produce CSV file using Spring Boot
                            
                                How do I import a .csv file into my Hasura PostgreSQL database?
                            
                                Quote only the required columns using pandas to_csv
                            
                                How do I get the filename from a tempfile object?
                            
                                Python Pandas sort by Time and group by user ID
                            
                                Cannot find patterns in the input file while importing CSV into SQL Server
                            
                                KeyError When Assigning Dictionary Keys and Values
                            
                                How to load mixed record type fixed width file with two headers into two separate files
                            
                                S3 Select retrieve headers in the CSV
                            
                                Convert column text data into features using python to use for machine learning
                            
                                FileNotFoundError: [Errno 2] when packaging for PyPI
                            
                                How to remove all occurrences of c2a0 in a string with PHP?
                            
                                How to create a csv file from List<String[]>
                            
                                Ruby - Creating a file in memory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Export large amount of data from Cassandra to CSV

Tags:

csv

cassandra

bigdata

cassandra-2.0

KrzysztofZalasa

People also ask

3 Answers

Alex Ott

Firman Gautama

Remis Haroon - رامز

Recent Activity

Donate For Us