Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra node stuck in Joining state

I'm trying to add a new node to an existing Cassandra 3.11.1.0 cluster with auto_bootstrap: true option. The new node Completed streaming the data from other nodes, the secondary index build and compact procedures for main table but after that it seems to be stuck in JOINING state. There are no errors/warnings in node's system.log - just INFO messages.

Also during secondary index build and compact procedures there was significant CPU load on node and now there is none. So it looks like the node is stuck during bootstrap and currently idle.

# nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns    Host ID                               Rack
UN  XX.XX.XX.109  33.37 GiB  256          ?       xxxx-9f1c79171069  rack1
UN  XX.XX.XX.47   35.41 GiB  256          ?       xxxx-42531b89d462  rack1
UJ  XX.XX.XX.32   15.18 GiB  256          ?       xxxx-f5838fa433e4  rack1
UN  XX.XX.XX.98   20.65 GiB  256          ?       xxxx-add6ed64bcc2  rack1
UN  XX.XX.XX.21   33.02 GiB  256          ?       xxxx-660149bc0070  rack1
UN  XX.XX.XX.197  25.98 GiB  256          ?       xxxx-703bd5a1f2d4  rack1
UN  XX.XX.XX.151  21.9 GiB   256          ?       xxxx-867cb3b8bfca  rack1

nodetool compactionstats shows that there are some compactions pending but I've no idea if there is some activity or it just stuck:

# nodetool compactionstats
pending tasks: 4
- keyspace_name.table_name: 4

nodetool netstats shows that counters of Completed requests for Small/Gossip messages are increasing:

# nodetool netstats
Mode: JOINING
Bootstrap xxxx-81b554ae3baf
    /XX.XX.XX.109
    /XX.XX.XX.47
    /XX.XX.XX.98
    /XX.XX.XX.151
    /XX.XX.XX.21
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed   Dropped
Large messages                  n/a         0              0         0
Small messages                  n/a         0         571777         0
Gossip messages                 n/a         0         199190         0

nodetool tpstats shows that counters of Completed requests for CompactionExecutor,MigrationStage, GossipStage pools are increasing:

# nodetool tpstats
Pool Name                         Active   Pending      Completed   Blocked  All time blocked
ReadStage                              0         0              0         0                 0
MiscStage                              0         0              0         0                 0
CompactionExecutor                     0         0            251         0                 0
MutationStage                          0         0         571599         0                 0
MemtableReclaimMemory                  0         0             98         0                 0
PendingRangeCalculator                 0         0              7         0                 0
GossipStage                            0         0         185695         0                 0
SecondaryIndexManagement               0         0              2         0                 0
HintsDispatcher                        0         0              0         0                 0
RequestResponseStage                   0         0              6         0                 0
ReadRepairStage                        0         0              0         0                 0
CounterMutationStage                   0         0              0         0                 0
MigrationStage                         0         0             14         0                 0
MemtablePostFlush                      0         0            148         0                 0
PerDiskMemtableFlushWriter_0           0         0             98         0                 0
ValidationExecutor                     0         0              0         0                 0
Sampler                                0         0              0         0                 0
MemtableFlushWriter                    0         0             98         0                 0
InternalResponseStage                  0         0             11         0                 0
ViewMutationStage                      0         0              0         0                 0
AntiEntropyStage                       0         0              0         0                 0
CacheCleanupExecutor                   0         0              0         0                 0

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
HINT                         0
MUTATION                   124
COUNTER_MUTATION             0
BATCH_STORE                  0
BATCH_REMOVE                 0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0

So it looks like node is still receiving some data from another nodes and applying it but I don't know how to check the progress and should I wait or cancel bootstrap. I've already tried to re-bootstrap this node and got the following situation: node was in UJ state for a long time (16 hours) had some pending compaction and 99.9% of CPU idle. Also I've added nodes to cluster about a month ago and there wasn't any issues - nodes joined during 2-3 hour and became in UN state.

Also nodetool cleanup is running on one of existing nodes on this node I see the following warnings in system.log:

**WARN  [STREAM-IN-/XX.XX.XX.32:46814] NoSpamLogger.java:94 log Spinning trying to capture readers [BigTableReader(path='/var/lib/cassandra/data/keyspace_name/table_name-6750375affa011e7bdc709b3eb0d8941/mc-1117-big-Data.db'), BigTableReader(path='/var/lib/cassandra/data/keyspace_name/table_name-6750375affa011e7bdc709b3eb0d8941/mc-1070-big-Data.db'), ...]**

Since cleanup is local procedure it cannot affect new node during bootstrap. But I can be wrong.

Any help will be appreciated.

like image 307
Evgenia Avatar asked Jan 27 '26 15:01

Evgenia


2 Answers

Sometimes this can happen. Maybe there was an issue with gossip communicating that joining had completed, or maybe another node quickly reported as DN and disrupted the process.

When this happens, you have a couple of options:

  1. You can always stop the node, wipe it, and try to join it again.
  2. If you're sure that all (or most) of the data is there, you can stop the node, and add a line in the cassandra.yaml of auto_bootstrap: false. The node will start, join the cluster, and serve its data. For this option, it's usually a good idea to run a repair once the node is up.
like image 166
Aaron Avatar answered Jan 30 '26 18:01

Aaron


Just Auto_bootstrap: false on cassandra.yaml of new node. and then restart the node. it will join as UN. After some time run full repair which will ensure the consistency.

like image 30
LetsNoSQL Avatar answered Jan 30 '26 17:01

LetsNoSQL



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!