I'm trying to add a new node to an existing Cassandra 3.11.1.0 cluster with auto_bootstrap: true option. The new node Completed streaming the data from other nodes, the secondary index build and compact procedures for main table but after that it seems to be stuck in JOINING state. There are no errors/warnings in node's system.log - just INFO messages.
Also during secondary index build and compact procedures there was significant CPU load on node and now there is none. So it looks like the node is stuck during bootstrap and currently idle.
# nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN XX.XX.XX.109 33.37 GiB 256 ? xxxx-9f1c79171069 rack1
UN XX.XX.XX.47 35.41 GiB 256 ? xxxx-42531b89d462 rack1
UJ XX.XX.XX.32 15.18 GiB 256 ? xxxx-f5838fa433e4 rack1
UN XX.XX.XX.98 20.65 GiB 256 ? xxxx-add6ed64bcc2 rack1
UN XX.XX.XX.21 33.02 GiB 256 ? xxxx-660149bc0070 rack1
UN XX.XX.XX.197 25.98 GiB 256 ? xxxx-703bd5a1f2d4 rack1
UN XX.XX.XX.151 21.9 GiB 256 ? xxxx-867cb3b8bfca rack1
nodetool compactionstats shows that there are some compactions pending but I've no idea if there is some activity or it just stuck:
# nodetool compactionstats
pending tasks: 4
- keyspace_name.table_name: 4
nodetool netstats shows that counters of Completed requests for Small/Gossip messages are increasing:
# nodetool netstats
Mode: JOINING
Bootstrap xxxx-81b554ae3baf
/XX.XX.XX.109
/XX.XX.XX.47
/XX.XX.XX.98
/XX.XX.XX.151
/XX.XX.XX.21
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name Active Pending Completed Dropped
Large messages n/a 0 0 0
Small messages n/a 0 571777 0
Gossip messages n/a 0 199190 0
nodetool tpstats shows that counters of Completed requests for CompactionExecutor,MigrationStage, GossipStage pools are increasing:
# nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 0 0 0 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 0 0 251 0 0
MutationStage 0 0 571599 0 0
MemtableReclaimMemory 0 0 98 0 0
PendingRangeCalculator 0 0 7 0 0
GossipStage 0 0 185695 0 0
SecondaryIndexManagement 0 0 2 0 0
HintsDispatcher 0 0 0 0 0
RequestResponseStage 0 0 6 0 0
ReadRepairStage 0 0 0 0 0
CounterMutationStage 0 0 0 0 0
MigrationStage 0 0 14 0 0
MemtablePostFlush 0 0 148 0 0
PerDiskMemtableFlushWriter_0 0 0 98 0 0
ValidationExecutor 0 0 0 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 98 0 0
InternalResponseStage 0 0 11 0 0
ViewMutationStage 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
CacheCleanupExecutor 0 0 0 0 0
Message type Dropped
READ 0
RANGE_SLICE 0
_TRACE 0
HINT 0
MUTATION 124
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE 0
READ_REPAIR 0
So it looks like node is still receiving some data from another nodes and applying it but I don't know how to check the progress and should I wait or cancel bootstrap. I've already tried to re-bootstrap this node and got the following situation: node was in UJ state for a long time (16 hours) had some pending compaction and 99.9% of CPU idle. Also I've added nodes to cluster about a month ago and there wasn't any issues - nodes joined during 2-3 hour and became in UN state.
Also nodetool cleanup is running on one of existing nodes on this node I see the following warnings in system.log:
**WARN [STREAM-IN-/XX.XX.XX.32:46814] NoSpamLogger.java:94 log Spinning trying to capture readers [BigTableReader(path='/var/lib/cassandra/data/keyspace_name/table_name-6750375affa011e7bdc709b3eb0d8941/mc-1117-big-Data.db'), BigTableReader(path='/var/lib/cassandra/data/keyspace_name/table_name-6750375affa011e7bdc709b3eb0d8941/mc-1070-big-Data.db'), ...]**
Since cleanup is local procedure it cannot affect new node during bootstrap. But I can be wrong.
Any help will be appreciated.
Sometimes this can happen. Maybe there was an issue with gossip communicating that joining had completed, or maybe another node quickly reported as DN and disrupted the process.
When this happens, you have a couple of options:
auto_bootstrap: false. The node will start, join the cluster, and serve its data. For this option, it's usually a good idea to run a repair once the node is up.Just Auto_bootstrap: false on cassandra.yaml of new node. and then restart the node. it will join as UN. After some time run full repair which will ensure the consistency.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With