Update - Short version:
The PropertyFileSnitch cassandra-topology.properties
for the first 3 nodes (Rack 1-3) states that only these nodes are in DC1 and the others are in DC2 by specifying the default value default=DC2:r1
. When the cluster was scaled up by adding nodes 4 and 5 the PropertyFileSnitch for these nodes was configured to add them in DC1 as well in Rack 4 and 5 but the snitch from the first 3 nodes remained unchanged and as a result the cluster is in this inconsistent state.
My question is if this cluster can be rebalanced (fixed). Would it suffice if I did a full cluster restart after fixing the cassandra-topology.properties
?
Please advise on how I can safely rebalance the cluster.
Longer version:
I am new to Cassandra and I started working on an already built cluster.
I have 5 nodes in the same data center on different racks running Cassandra version 3.0.5 with vnodes num_tokens: 256
and a keyspace with replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'} AND durable_writes = true
.
Historically there were only 3 nodes and the cluster was scaled up with an additional 2 nodes. I have an automatic repair script that runs nodetool repair
with options parallelism: parallel, primary range: false, incremental: true, job threads: 1
.
After a large amount of data was inserted the problems started to appear. When running the repair script on node 4 or 5 the node 2 gets overloaded: the CPU usage stays at 100%, the MutationStage queue grows and the GC pauses take at least 1s until the Cassandra process finally dies. The repair result is usually failed with error Stream failed (progress: 0%)
.
When running the nodetool status
command on nodes 1, 2 or 3 I get the following output:
Datacenter: DC2 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.0.0.13 10.68 GB 256 0.0% 75e17b8a r1 UN 10.0.0.14 9.43 GB 256 0.0% 21678ddb r1 Datacenter: DC1 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.0.0.10 16.14 GB 256 100.0% cf9d327f Rack1 UN 10.0.0.11 22.83 GB 256 100.0% e725441e Rack2 UN 10.0.0.12 19.66 GB 256 100.0% 95b5c8e3 Rack3
But when running the nodetool status
command on nodes 4 or 5 I get the following output:
Datacenter: DC1 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.0.0.13 10.68 GB 256 58.9% 75e17b8a Rack4 UN 10.0.0.14 9.43 GB 256 61.1% 21678ddb Rack5 UN 10.0.0.10 16.14 GB 256 60.3% cf9d327f Rack1 UN 10.0.0.11 22.83 GB 256 61.4% e725441e Rack2 UN 10.0.0.12 19.66 GB 256 58.3% 95b5c8e3 Rack3
After further investigation it seems that the PropertyFileSnitch cassandra-topology.properties
was not updated on nodes 1, 2 and 3 (which are also the seeds for this cluster) after the cluster was scaled up.
Thanks!
After searching in several online resources I found some possible solutions. I'll post them here so it will be accessible for everyone.
From Practical Cassandra: A Developer's Approach:
Ring View Differs between Nodes
When the ring view differs between nodes, it is never a good thing. There is also no easy way to recover from this state. The only way to recover is to do a full cluster restart. A rolling restart won’t work because the Gossip protocol from the bad nodes will inform the newly booting good nodes of the bad state. A full cluster restart and bringing the good nodes up first should enable the cluster to come back up in a good state.
The same solution can be found also in DataStax docs: View of ring differs between some nodes
I also found a similar question on Apache Cassandra Community. The answer on the community users thread is:
What has happened is that you have now two datacenters in your cluster. The way they replicate information will depend on your keyspace settings. Regarding your process I don't think it is safe to do it that way. I'd start off by decommissioning nodes 4 and 5 so that your cluster is back to 1 datacenter with 3 nodes and then add them sequentially again making sure the configuration in the Snitch is the proper one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With