I am running a write-heavy program (10 threads peaks at 25K/sec writes) on a 24 node Cassandra 3.5 cluster on AWS EC2 (each host is of c4.2xlarge type: 8 vcore and 15G ram)
Every once in a while my Java client, using DataStax driver 3.0.2, would get write timeout issue:
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency TWO (2 replica were required but only 1 acknowledged the write)
at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:73)
at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:26)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:64)
The error happens infrequently and in a very unpredictable way. So far, I am not able to link the failures to anything specific (e.g. program running time, data size on disk, time of the day, indicators of system load such as CPU, memory, network metrics) Nonetheless, it is really disrupting our operations.
I am trying to find the root cause of the issue. Looking online for options, I am a bit overwhelmed by all the leads out there, such as
One thing is really confusing during my research is that I am getting this error from a fully replicated cluster with very few ClientRequest.timeout.write events:
On paper, the situation should be well within Cassandra's failsafe range. But why my program still failed? Are the numbers not what they appear to be?
Its not always necessarily a bad thing to see timeouts or errors especially if you're writing at a higher consistency level, the writes may still get through.
I see you mention CL=ONE
you could still get timeouts here but the write (mutation) still have got through. I found this blog really useful: https://www.datastax.com/dev/blog/cassandra-error-handling-done-right. Check your server side (node) logs at the time of the error to see if you have things like ERROR / WARN / GC pauses (like one of the comments mentions above) these kind of events can cause the node to be unresponsive and therefor a timeout or other type of error.
If your updates are idempotent (ideally) then you can build in some retry mechanism.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With