Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra COPY consistently fails

I was trying to import a CSV with about 20 million rows.

I did a pilot run with a few 100 rows worth of CSV just to check if the columns were in order and that there were no parsing errors. All went well.

Every time I tried importing the 20 million row CSV, it failed after varying amounts of time. On my local machine it failed after 90 minutes with the following error. On the server box it fails within 10 minutes:

Processed 4050000 rows; Write: 624.27 rows/ss
code=1100 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info=

{'received_responses': 0, 'required_responses': 1, 'write_type': 0, 'consistency': 1}
Aborting import at record #4050617. Previously-inserted values still present.
4050671 rows imported in 1 hour, 26 minutes, and 43.649 seconds.

Cassandra: Coordinator node timed out waiting for replica nodes' responses (It is a one node cluster and replication factor is 1 so why is it wating for other nodes is another question)

Then based on recommendation in another thread I changed the write time out though I was not convinced it was the root cause.

write_request_timeout_in_ms: 20000 

(Also tried changing it to 300000)

But it still eventually fails.

So now, I have chopped the original CSV into many 500,000 line CSVs. This has a better success rate (compared to 0!). But even this fails 2 of 5 times for various reasons.

Sometimes I get the following error:

Processed 460000 rows; Write: 6060.32 rows/ss
Connection heartbeat failure
Aborting import at record #443491. Previously inserted records are still present, and some records after that may be present as well.

Other times it just stops updating the progress on console and the only way out is to abort using Ctrl+C

I've spent most of the day like this. My RDBMS is running happily with 5 billion rows. I wanted to test Cassandra with 10 times as much data but I'm having trouble even importing a million rows at a time.

One observation about how the COPY command proceeds is this: Once the command is entered, it starts writing at the rate of about 10,000 rows per second. It sustanins this speed till it has inserted about 80,000 rows. Then there is a pause of about 30 seconds after which it consumes another 70,000 to 90,000 rows, pauses for another 30 seconds and so on till it either finishes all rows in the CSV or fails midway with an error or simply hangs.

I need to get to the root of this. I really hope to find that I am doing something silly and it's not something I have to accept and work around.

I am using Cassandra 2.2.3

like image 351
Dojo Avatar asked Oct 22 '15 07:10

Dojo


1 Answers

There is a lot of people having trouble with the COPY command, it seems that it works for small datasets but it starts to fail when you have a lot of data.

In the documentation they recommend to use the SSTable loader if you have a few million rows to import, i used it with my company and I had a lot of consistency problems.

I have tried everything and for me the safest way to import large amount of data into cassandra is by writing a little script that reads your CSV and then execute async queries. Python does it very well.

like image 56
Will Avatar answered Sep 30 '22 04:09

Will