We're currently working with Cassandra on a single node cluster to test application development on it. Right now, we have a really huge data set consisting of approximately 70M lines of texts that we would like dump into a Cassandra. We have tried all of the following: <ul> <li>Line by line insertion using python Cassandra driver</li> <li>Copy command of Cassandra</li> <li>Set compression of sstable to none</li> </ul> We have explored the option of the sstable bulk loader, but we don't have an appropriate .db format for this. Our text file to be loaded has 70M lines that look like: <pre class="prettyprint"><code>2f8e4787-eb9c-49e0-9a2d-23fa40c177a4 the magnet programs succeeded in attracting applicants and by the mid-1990s only #about a #third of students who #applied were accepted. </code></pre> The column family that we're intending to insert into has this creation syntax: <pre class="prettyprint"><code>CREATE TABLE post ( postid uuid, posttext text, PRIMARY KEY (postid) ) WITH bloom_filter_fp_chance=0.010000 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.000000 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=0.100000 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='99.0PERCENTILE' AND memtable_flush_period_in_ms=0 AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={}; </code></pre> Problem: The loading of the data into even a simple column family is taking forever -- 5hrs for 30M lines that were inserted. We were wondering if there is any way to expedite this as the performance for 70M lines of the same data being loaded into MySQL takes approximately 6 minutes on our server. We were wondering if we have missed something? Or if someone could point us in the right direction? Many thanks in advance!

The sstableloader is the fastest way to import data into Cassandra. You have to write the code to generate the sstables, but if you really care about speed this will give you the most bang for your buck. This article is a bit old, but the basics still apply to how you generate the SSTables . If you really don't want to use the sstableloader, you should be able to go faster by doing the inserts in parallel. A single node can handle multiple connections at once, and you can scale out your Cassandra cluster for increased throughput.

Cassandra: Load large data fast

Tags:

copy

cassandra

We're currently working with Cassandra on a single node cluster to test application development on it. Right now, we have a really huge data set consisting of approximately 70M lines of texts that we would like dump into a Cassandra.

We have tried all of the following:

Line by line insertion using python Cassandra driver
Copy command of Cassandra
Set compression of sstable to none

We have explored the option of the sstable bulk loader, but we don't have an appropriate .db format for this. Our text file to be loaded has 70M lines that look like:

2f8e4787-eb9c-49e0-9a2d-23fa40c177a4    the magnet programs succeeded in attracting applicants and by the mid-1990s only #about a #third of students who #applied were accepted.

The column family that we're intending to insert into has this creation syntax:

CREATE TABLE post (
  postid uuid,
  posttext text,
  PRIMARY KEY (postid)
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={};

Problem: The loading of the data into even a simple column family is taking forever -- 5hrs for 30M lines that were inserted. We were wondering if there is any way to expedite this as the performance for 70M lines of the same data being loaded into MySQL takes approximately 6 minutes on our server.

We were wondering if we have missed something? Or if someone could point us in the right direction?

Many thanks in advance!

310

asked May 08 '14 00:05

QR_Monica

2 Answers

The sstableloader is the fastest way to import data into Cassandra. You have to write the code to generate the sstables, but if you really care about speed this will give you the most bang for your buck.

This article is a bit old, but the basics still apply to how you generate the SSTables .

If you really don't want to use the sstableloader, you should be able to go faster by doing the inserts in parallel. A single node can handle multiple connections at once, and you can scale out your Cassandra cluster for increased throughput.

191

answered Oct 06 '22 07:10

psanford

I have a two node Cassandra 2.? cluster. Each node is I7 4200 MQ laptop, 1 TB HDD, 16 gig RAM). Have imported almost 5 billion rows using copy command. Each CSV file is a about 63 gig with approx 275 million rows. Takes about 8-10 hours to complete the import/per file.

Approx 6500 rows per sec.

YAML file is set to use 10 gigs of RAM. JIC that helps.

answered Oct 06 '22 06:10

Deepak102ind

Related questions
                            
                                How to "correctly" copy a types.SimpleNamespace object?
                            
                                Copy-Item when destination folder exists or doesn't
                            
                                How to add better copy detection to gitk?
                            
                                Maven project with native dependency and copying files
                            
                                Marshal.Copy, copying an array of IntPtr into an IntPtr
                            
                                HTC Sense Copy/Paste API's
                            
                                Copy HTML from an original document into a popup window (using JQuery)
                            
                                Ant: copy the same fileset to multiple places - continued
                            
                                Is it safe (pun intended) to copy bitmap regions using lockbits this way?
                            
                                copy vs deepcopy: semantics
                            
                                What is the most practical way to keep working on a git stash on a different computer?
                            
                                Change objects in NSUserDefaults without creating and re-setting copies
                            
                                Implicitly-declared Move-Operations do not fallback to Copy?
                            
                                Pipe between sockets
                            
                                C# Track copied files [duplicate]
                            
                                Copy a file using a makefile at runtime
                            
                                What is the run time of String.toCharArray()?
                            
                                how to copy and paste code, in color, keeping line numbers, using netbeans 7.2.1
                            
                                Copying file in node.js build script
                            
                                Android Studio: How to create quick clones or copies of current project

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With