I am running a conventional MongoDB Replica Set consisting of 3 members (member1 in datacenter A, member2 and member3 in datacenter B).
member1 is the current PRIMARY and I am adding members 2 and 3 via rs.add()
. They are performing their initial sync and become SECONDARY very soon. Everything is fine all day long and the replication delay of both members is 0 seconds until 2 AM at nighttime.
Now: Every night at 2 AM both members shift into the RECOVERING state and stop replication at all, which leads to a replication delay of hours when I am having a look into rs.printSlaveReplicationInfo()
in the morning hours. At around 2 AM there are no massive inserts or maintenance tasks known to me.
I get the following log entries on the PRIMARY:
2015-10-09T01:59:38.914+0200 [initandlisten] connection accepted from 192.168.227.209:59905 #11954 (37 connections now open)
2015-10-09T01:59:55.751+0200 [conn11111] warning: Collection dropped or state deleted during yield of CollectionScan
2015-10-09T01:59:55.869+0200 [conn11111] warning: Collection dropped or state deleted during yield of CollectionScan
2015-10-09T01:59:55.870+0200 [conn11111] getmore local.oplog.rs cursorid:1155433944036 ntoreturn:0 keyUpdates:0 numYields:1 locks(micros) r:32168 nreturned:0 reslen:20 134ms
2015-10-09T01:59:55.872+0200 [conn11111] end connection 192.168.227.209:58972 (36 connections now open)
And, which is more interesting, I get the following log entries on both SECONDARYs:
2015-10-09T01:59:55.873+0200 [rsBackgroundSync] repl: old cursor isDead, will initiate a new one
2015-10-09T01:59:55.873+0200 [rsBackgroundSync] replSet syncing to: member1:27017
2015-10-09T01:59:56.065+0200 [rsBackgroundSync] replSet error RS102 too stale to catch up, at least from member1:27017
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet our last optime : Oct 9 01:59:23 5617035b:17f
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet oldest at member1:27017 : Oct 9 01:59:23 5617035b:1af
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet error RS102 too stale to catch up
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet RECOVERING
Which is also striking - the start of the oplog "resets" itself every night at around 2 AM:
configured oplog size: 990MB
log length start to end: 19485secs (5.41hrs)
oplog first event time: Fri Oct 09 2015 02:00:33 GMT+0200 (CEST)
oplog last event time: Fri Oct 09 2015 07:25:18 GMT+0200 (CEST)
now: Fri Oct 09 2015 07:25:26 GMT+0200 (CEST)
I am not sure if this is somehow correlated to the issue. I am also wondering that such a small delay (Oct 9 01:59:23 5617035b:17f
<-> Oct 9 01:59:23 5617035b:1af
) lets the members become stale.
Could this also be a server (VM host) time issue or is it something completely different? (Why is the first oplog event being "resetted" every night and not "shifting" to a timestamp like NOW minus 24 hrs?) What can I do to investigate and to avoid?
Upping the oplog size should solve this (per our comments).
Some references for others who run into this issue
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With