Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra 3.10 debug.log contains frequent "FailureDetector.java:457 - Ignoring interval time of..."

Tags:

cassandra

The debug.log files for one of our Cassandra 3.10 clusters has frequent messages similar to “FailureDetector.java:457 - Ignoring interval time of…”

The messages appear even if the cluster is idle. I see the messages at a rate of about 1 per second on each node of this 6 node cluster (3 nodes each in two data centers).

Can someone tell me what causes the messages and if they are something to be concerned about?

We have a couple of other small clusters supporting the same application (different environments) and I see this message much less often (days apart).

like image 559
B. Peek Avatar asked Jun 27 '17 22:06

B. Peek


1 Answers

The FailureDetector is responsible of deciding if a node is considered UP or DOWN.

The gossip process tracks state from other nodes both directly (nodes gossiping directly to it) and indirectly (nodes communicated about secondhand, third-hand, and so on). Rather than have a fixed threshold for marking failing nodes, Cassandra uses an accrual detection mechanism to calculate a per-node threshold that takes into account network performance, workload, and historical conditions. During gossip exchanges, every node maintains a sliding window of inter-arrival times of gossip messages from other nodes in the cluster.

Here you can find the source code, which gives you the log message. It is set to DEBUG level because they may be helpful in tracking down the actual issue causing the latency, but don't indicate a problem on their own.

In other words: your node measures the acknowledgement latency for each gossip message sent to the other nodes e.g: X nanosec for IP address1, Z nanosec for IP address2, etc. If eitherX or Y is above the expected 2 sec threshold as stated in MAX_INTERVAL_IN_NANO, it will get reported.

Problems, which can cause this log message:

  • Huge load on the node(s): e.g too many large partitions
  • High pressure: e.g. too many queries in sort period of time
  • Bad network connection

The extra FailureDetector logging was added with this: Expose phi values from failure detector via JMX and tweak debug and trace logging (CASSANDRA-9526)

and also I found this open issue, might be related to your problem: The failure detector becomes more sensitive when the network is flakey(CASSANDRA-9536)

Also I find this article about Gossiping and Failure Detection very useful.

like image 79
Andrea Nagy Avatar answered Oct 21 '22 17:10

Andrea Nagy