Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

error while trying to stepdown mongodb replica set

Setup:
1 primary, 3 secondaries and 1 arbiter (All 5 are running.)

When I'm doing a stepDown() in primary, getting the below error in the shell -

m101:PRIMARY> rs.stepDown() 
2017-06-07T15:01:21.357 E QUERY    [thread1] Error: error doing query: failed: network error while attempting to run command 'replSetStepDown' on host '127.0.0.1:27018'  :  
DB.prototype.runCommand@src/mongo/shell/db.js:132:1 
DB.prototype.adminCommand@src/mongo/shell/db.js:150:16  
rs.stepDown@src/mongo/shell/utils.js:1261:12
@(shell):1:1 
2017-06-07T15:01:21.360 I NETWORK  [thread1] trying reconnect to 127.0.0.1:27018 (127.0.0.1) failed
2017-06-07T15:01:21.361 I NETWORK  [thread1] reconnect 127.0.0.1:27018 (127.0.0.1) ok
m101:SECONDARY>

Note that the primary changed to secondary in the end. But why I'm getting this error, "failed: network error" here?

Below is the section from the log file of the primary which was stepped down -

2017-06-07T15:01:22.170 I REPL     [replication-5] Restarting oplog query due to error: InterruptedDueToReplStateChange: operation was interrupted. Last fetched optime (with hash): { ts: Timestamp 1496827872000|1, t: 2 }[-3490433912114125886]. Restarts remaining: 3
2017-06-07T15:01:22.170 I REPL     [replication-5] Scheduled new oplog query Fetcher source: localhost:27018 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1496827872000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 2 } query metadata: { $replData: 1, $oplogQueryData: 1, $ssm: { $secondaryOk: true } } active: 1 timeout: 65000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 159434 -- target:localhost:27018 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp 1496827872000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 2 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
2017-06-07T15:01:22.173 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: Sync source's last applied OpTime { ts: Timestamp 1496827872000|1, t: 2 } is not greater than our last fetched OpTime { ts: Timestamp 1496827872000|1, t: 2 }. Choosing new sync source.
2017-06-07T15:01:22.176 I REPL     [rsBackgroundSync] could not find member to sync from
2017-06-07T15:01:22.180 I REPL     [ReplicationExecutor] Member localhost:27018 is now in state SECONDARY
2017-06-07T15:01:27.335 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to localhost:27018: InvalidSyncSource: Sync source was cleared. Was localhost:27018
2017-06-07T15:01:29.879 I -        [conn2] end connection 127.0.0.1:59400 (5 connections now open)
2017-06-07T15:01:30.899 I NETWORK  [thread1] connection accepted from 127.0.0.1:64421 #28 (5 connections now open)
2017-06-07T15:01:30.899 I NETWORK  [conn28] received client metadata from 127.0.0.1:64421 conn28: { driver: { name: "NetworkInterfaceASIO-Replication", version: "3.4.4" }, os: { type: "Windows", name: "Microsoft Windows 8", architecture: "x86_64", version: "6.2 (build 9200)" } }
2017-06-07T15:01:30.917 I NETWORK  [thread1] connection accepted from 127.0.0.1:64425 #29 (6 connections now open)
2017-06-07T15:01:30.920 I NETWORK  [conn29] received client metadata from 127.0.0.1:64425 conn29: { driver: { name: "NetworkInterfaceASIO-Replication", version: "3.4.4" }, os: { type: "Windows", name: "Microsoft Windows 8", architecture: "x86_64", version: "6.2 (build 9200)" } }
2017-06-07T15:01:30.927 I NETWORK  [thread1] connection accepted from 127.0.0.1:64428 #30 (7 connections now open)
2017-06-07T15:01:30.928 I NETWORK  [conn30] received client metadata from 127.0.0.1:64428 conn30: { driver: { name: "NetworkInterfaceASIO-Replication", version: "3.4.4" }, os: { type: "Windows", name: "Microsoft Windows 8", architecture: "x86_64", version: "6.2 (build 9200)" } }
2017-06-07T15:01:31.561 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2017-06-07T15:01:31.562 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2017-06-07T15:01:31.563 I REPL     [ReplicationExecutor] VoteRequester(term 2 dry run) received a no vote from localhost:27018 with reason "candidate's term is lower than mine"; response message: { term: 3, voteGranted: false, reason: "candidate's term is lower than mine", ok: 1.0 }
2017-06-07T15:01:31.563 I REPL     [ReplicationExecutor] not running for primary, we have been superceded already
2017-06-07T15:01:32.183 I REPL     [ReplicationExecutor] Member localhost:27019 is now in state PRIMARY
2017-06-07T15:01:32.185 I REPL     [rsBackgroundSync] sync source candidate: localhost:27019
2017-06-07T15:01:32.186 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to localhost:27019
2017-06-07T15:01:33.198 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to localhost:27019, took 1012ms (1 connections now open to localhost:27019)

Trying to understand what's happening here. Any leads would be appreciated.

Thanks in advance!

like image 667
Van Peer Avatar asked Feb 05 '23 07:02

Van Peer


1 Answers

This is expected behaviour. When a primary steps down it drops all connections, including your active connection in the mongo shell. The mongo shell then reestablishes the connection (as per the "trying reconnect" message).

The rs.stepDown() documentation could perhaps be clearer on this point with an example of the shell output, but the relevant description is:

Upon successful stepdown, rs.stepDown() forces all clients currently connected to the database to disconnect. This helps ensure that the clients maintain an accurate view of the replica set.

Because the disconnect includes the connection used to run the command, you cannot retrieve the return status of the command if the command completes successfully; i.e. you can only retrieve the return status of the command if it errors. When running the command in a script, the script should account for this behavior.

like image 87
Stennie Avatar answered Mar 09 '23 01:03

Stennie