Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simultaneous repairs cause repair to hang

Tags:

cassandra

I'm running Cassandra 3.7 in a 24 node cluster with 3 data centers and 256 vnodes per node, and each node uses a cron job to run "nodetool repair -pr" once a day during a different hour of the day from the other nodes.

Sometimes the repair takes more than one hour to complete and the repairs overlap. When this happens, repair starts to get exceptions and can hang in a bad state. This leads to a cascading failure where each hour another node will try to start a repair and it will also hang.

Recovering from this is difficult. The only way I have found is to restart not just the nodes with a stuck repair, but all the nodes in the cluster.

The only idea I have for dealing with this is to build some kind of service that checks if any other node is running repair before it starts a repair, maybe by publishing in a Cassandra table when a repair is in progress.

I'm not sure how I will be able to repair all the nodes if the cluster gets bigger since there soon won't be enough hours in the day to run repair on all the nodes one by one.

So my main question is, am I running repair incorrectly and what is the recommended way to regularly repair all the nodes of a large cluster?

Is there a way to repair more than one node at a time? The documentation hints that there is, but it isn't clear how to do that. Is it normal that repair would crash and burn when run on more than one node at a time? Is there an easier way to kill the stuck repairs than restarting all the nodes?

Some things I tried:

  1. Running "nodetool repair" without -pr, but this also hangs if run on multiple nodes at once.
  2. Running "nodetool repair -dcpar" - this seems to repair the token ranges owned by the node it is run on in all the data centers, but it also hangs if run on multiple nodes at once.

My keyspace keeps only one replica per data center so I don't think I can use the -local option.

Some of the exceptions I see when repair hangs are:

ERROR [ValidationExecutor:4] 2016-07-07 12:00:31,938 CassandraDaemon.java (line 227) Exception in thread Thread[ValidationExecutor:4,1,main]
java.lang.NullPointerException: null
        at org.apache.cassandra.service.ActiveRepairService$ParentRepairSession.getActiveSSTables(ActiveRepairService.java:495) ~[main/:na]
        at org.apache.cassandra.service.ActiveRepairService$ParentRepairSession.access$300(ActiveRepairService.java:451) ~[main/:na]
        at org.apache.cassandra.service.ActiveRepairService.currentlyRepairing(ActiveRepairService.java:338) ~[main/:na]
        at org.apache.cassandra.db.compaction.CompactionManager.getSSTablesToValidate(CompactionManager.java:1320) ~[main/:na]

ERROR [Repair#6:1] 2016-07-07 12:00:35,221 CassandraDaemon.java (line 227) Exception in thread Thread[Repair#6:1,5,RMI Runtime]
com.google.common.util.concurrent.UncheckedExecutionException: org.apache.cassandra.exceptions.RepairException: [repair #67bd9b10-...
]]] Validation failed in /198.18.87.51
        at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1525) ~[guava-18.0.jar:na]
        at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1511) ~[guava-18.0.jar:na]
        at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160) ~[main/:na]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_71]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_71]
        at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_71]
Caused by: org.apache.cassandra.exceptions.RepairException: [repair #67bd9b10...
]]] Validation failed in /198.18.87.51
        at org.apache.cassandra.repair.ValidationTask.treesReceived(ValidationTask.java:68) ~[main/:na]
        at org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:183) ~[main/:na]
        at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:439) ~[main/:na]
        at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:169) ~[main/:na]

ERROR [ValidationExecutor:3] 2016-07-07 12:42:01,298 CassandraDaemon.java (line 227) Exception in thread Thread[ValidationExecutor:3,1,main]
java.lang.RuntimeException: Cannot start multiple repair sessions over the same sstables
        at org.apache.cassandra.db.compaction.CompactionManager.getSSTablesToValidate(CompactionManager.java:1325) ~[main/:na]
        at org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:1215) ~[main/:na]
        at org.apache.cassandra.db.compaction.CompactionManager.access$700(CompactionManager.java:81) ~[main/:na]
        at org.apache.cassandra.db.compaction.CompactionManager$11.call(CompactionManager.java:844) ~[main/:na]
like image 264
Jim Meyer Avatar asked Jul 07 '16 15:07

Jim Meyer


1 Answers

You can try cassandra-reaper: Software to run automated repairs of Cassandra https://github.com/thelastpickle/cassandra-reaper

like image 106
Ji Zhou Avatar answered Nov 07 '22 10:11

Ji Zhou