Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra Frequent Read Write Timeouts

I had changed whole codebase from Thrift to CQL using datastax java driver 1.0.1 and cassandra 1.2.6..

with thrift I was getting frequent timeouts from start, I was not able to proceed...Adopting CQL, tables designed as per that I got success and less timeouts....

With that I was able to insert huge data which were not working with thrift...But after a stage, data folder around 3.5GB. I am getting frequent write timeout exceptions. even I do same earlier working use case again that also throws timeout exception now. ITS RANDOM ONCE WORKED IS NOT WORKING AGAIN EVEN AFTER FRESH SETUP.

CASSADNRA SERVER LOG

this is cassandra server partial log DEBUG mode at then time I got the error :

http://pastebin.com/rW0B4MD0

Client exception is :

Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)
    at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:54)
    at com.datastax.driver.core.ResultSetFuture.extractCauseFromExecutionException(ResultSetFuture.java:214)
    at com.datastax.driver.core.ResultSetFuture.getUninterruptibly(ResultSetFuture.java:169)
    at com.datastax.driver.core.Session.execute(Session.java:107)
    at com.datastax.driver.core.Session.execute(Session.java:76)

Infrastructure : 16GB machine with 8GB heap given to cassandra, i7 processor.. I am using SINGLE node cassandra with this yaml tweaked for timeout, everything else is default :

  • read_request_timeout_in_ms: 30000
  • range_request_timeout_in_ms: 30000
  • write_request_timeout_in_ms: 30000
  • truncate_request_timeout_in_ms: 60000
  • request_timeout_in_ms: 30000

USE CASE : i am running a usecase which stores Combinations(my project terminology) in cassandra....Currently testing storing 250 000 combinations with 100 parallel threads..each thread storing one combination...real case i need to support of tens of millions but that would need different hardware and multi node cluster...

In Storing ONE combination takes around 2sec and involves:

  • 527 INSERT INTO queries
  • 506 UPDATE queries
  • 954 SELECT queries

100 parallel threads parallel storing 100 combinations.

I had found behaviour of WRITE TIMEOUTS random some time it works till 200 000 then throw timeouts AND sometimes do not work even for 10k combinations. RANDOM BEHAVIOUR.

like image 784
user2572801 Avatar asked Aug 07 '13 11:08

user2572801


2 Answers

I found that during some cassandra-stress read operations, if i set the rate threads too high i will get that CL error. Consider to lower during your test the number of threads to something affordable for your pool to sustain in order to beat the

  • read_request_timeout_in_ms

In my opinion modifying that in cassandra.yaml is not always a good idea. Consider the hardware resources your machines work with.

for egg :

cassandra-stress read n=100000 cl=ONE -rate threads=200 -node N1

will give me the error, while

cassandra-stress read n=100000 cl=ONE -rate threads=121 -node N1

will do smoothly the job.

Hope it can help you up guys.

P.S. when you do read tests try to spread the reads even on the data with the '-pop dist=UNIFORM(1..1000000)' or how much you want.

like image 165
Mr'Black Avatar answered Oct 27 '22 10:10

Mr'Black


Just spent some time to read my dev cassandra nodes config yaml, because i had a similar problem. My system stalled and throws time out when i tried to load about 3 billion sha2 hashes to my dev node with only 600MB RAM ;)

I fixed it by decreasing cache sizes and waits before flush and so on. This made the node slower on writes, but it was getting stable. I was then able to load as many data as i needed.

But sorry i could't figure out which options that were. I remember that i read docs about performance tuning and how to calculate correct values for your system based on cpu cores, ram etc.

The problem i had was that the caches were not written fast enough to disk, so its start to block everything. After saying, write more often and let new request wait, the node was getting stable and my import was getting a litle bit slower.

It seams that the default options of cassandra are for heavy ram maschines with lots of cores in a multi node cluster which can spread the load. To get it running on local dev environment, screw it down. Its dev env and not life system, take the time to get a coffee or two ;)

Hope that helps to get thinking in the correct way

like image 33
Rene M. Avatar answered Oct 27 '22 08:10

Rene M.