Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does my MongoDB replica keep falling behind?

While converting a collection to a capped collection last night, my secondary's optime started lagging behind the primary. It advanced slowly, a couple seconds every several minutes, and eventually fell out of the primary's oplog window. Per the instructions here I stopped mongod on the secondary, deleted all data files, and restarted it, though I forgot to lock the primary from writes. Secondary went through its initialization phase, which took a solid amount of time, and was finally back in business, but when I logged in, replication was now even farther behind.

Since this is the cloud, after all, I created an image of my primary (which is supposed to copy all data), though I couldn't run db.fsyncLock() at the time, because it was taking some writes. The new image finishes, and I launch a new server based on that image, get it added to my replica set, remove the old secondary, and life is good, right? Not quite - the new secondary is lagging about an hour behind, and through the course of the day (and tonight) eventually gets to the point where it's 14 hours behind (though strangely enough still within the oplog window).

I take the next step from the "resyncing a stale member page". Shutdown mongod on both servers, gzip and copy my data folder from primary to secondary, unzip, fire them both up, db.fsyncLock() my primary. What blows my mind is that even with THE SAME DATA, after initializing, my secondary says it's 1 hour behind. I add it back into the replica set, and it catches up quickly to being 5 minutes behind.

All good, right? No - flash forward, secondary is advancing sloooooowly, and is now 20 minutes behind. Mongostat has secondary pegged at 95+ locked %, iostat -xm 2 doesn't show anything crazy - primary is currently idle from not taking writes, secondary is definitely not doing much at all (.04 wMB/second). Not sure if it's worth mentioning, but primary currently feels dog slowunresponsive logging into the mongo shell, etc.

What gives, Mongo? Why can't you just catch up? What am I doing wrong in trying to get my secondary caught up?

EDIT Answering questions:

  • Version: 2.0.4
  • Hardware: Both nodes are the same hardware, near as I can tell - 8GB RAM, quad core CPU. I assume it's something virtualized.
  • Write rate: it varies. As mentioned, last night I was converting to a capped collection, which triggered the whole thing. Overnight, there was a process writing around a couple hundred small documents (~155 bytes each) a few times an hour, so max I'd estimate around 100-200kbytes/hour. During the day, processing was more intense, updating hundreds of thousands of 500-byte documents, and writing a couple hundred thousand more. Still not talking about enormous data quantities. EDIT found some iostat output from earlier today:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              1.00  2564.50  243.50  282.50  8986.00 11388.00    77.47    11.32   21.46    2.36   37.93   0.50  26.50

That one was particularly bursty at 11 wMB/s, saw util% hit 34% with 7 wMB/s, and 72% at 52 rMB/s. So not saturated, but definitely a read-heavy workload in the morning. It is interesting though that despite having obj. size ~5GB, and ~1GB indexes (see below), there's so much disk activity. Shouldn't that all be in RAM?

  • Working set: I still haven't found the accepted methodology for calculating working set, but if it helps:
    "collections" : 21,
    "objects" : 15540092,
    "avgObjSize" : 325.26198326238995,
    "dataSize" : 5054601144,
    "storageSize" : 5874327552,
    "numExtents" : 132,
    "indexes" : 43,
    "indexSize" : 864366720,
    "fileSize" : 10666115072,
    "nsSizeMB" : 16,
    "ok" : 1

I can't imagine that that's overwhelming 8GB of RAM, though I could be wrong.

  • some recent mongostat samples from secondary:
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn    set repl       time 
    *0     *0     *0     *0       0     1|0       0  22.2g  44.9g   912m      0     99.2          0       0|0     0|1     2k   303b   151 mySet  SEC   03:47:54 
    *0     *0     *0     *0       0     1|0       0  22.2g  44.9g  1.85g      0      101          0       0|0     0|1     3k   303b   151 mySet  SEC   03:48:04 

EDIT

Tried more things. I shutdown the primary (now called A, secondary will be B), deleted its data, and unzipped its snapshot (now a couple hours old, but at this point, we're not writing anything new). Started A up with --fastsync, and it's still like 45 seconds behind the B's (now primary) optime, which had been hanging out at about 02:19:52UTC. Finally about an hour later, A catches up, so I call rs.stepDown() on B. Instantly, rs.status() shows me that both servers have optimes around 04:08 UTC, but B (now secondary) is again lagging by 17 seconds...then 30 seconds...now 7 minutes...

EDIT

A few minutes after taking @matulef's suggestion and re-creating indexes on my capped collections, as well as re-starting the secondary's mongod process, its optime has only increased a few seconds. Secondary locked % from mongostat is still hovering from 95-104%, and interestingly enough, res size was swinging pretty wildly from 100M to 2GB and back again before settling around 1GB.

EDIT (the next evening)

Conclusion to the story - @matulef was on the right track, I should have been more careful about converting a replicated collection to a capped collection. What follows is what happened, though I don't advertise this as being data safe - I freely acknowledge I may have lost some data in this process, so YMMV.

Creating the indexes for the capped collections on the primary (A) didn't propagate to the secondary (B), and A happened to fail over (not intentionally). Once B was primary, I manually created the indexes on the capped collections there, and the resync operation to bring A in line with B started moving quickly. Unfortunately for me, my oplog windows no longer lined up, so I did end up having to snapshot data from B to A. Once I restarted mongo with the same set of data, A & B were happy again, and replication has been back in sync ever since.

like image 200
awshepard Avatar asked Jul 11 '12 02:07

awshepard


People also ask

What causes replication lag in MongoDB?

Why are You Experiencing MongoDB Replication Lag? MongoDB replication lag occurs when the secondary node cannot replicate data fast enough to keep up with the rate that data is being written to the primary node.

What is the maximum number of nodes in a MongoDB replica set?

What kind of replication does MongoDB support? MongoDB supports replica sets, which can have up to 50 nodes.

How does replica set work in MongoDB?

MongoDB achieves replication by the use of replica set. A replica set is a group of mongod instances that host the same data set. In a replica, one node is primary node that receives all write operations. All other instances, such as secondaries, apply operations from the primary so that they have the same data set.


1 Answers

The problem here is that capped collections do not have an _id index by default (and the "convertToCapped" command actually drops all indexes for that collection). This is an issue because the secondaries perform updates by applying ops from the oplog, which refer to documents by their _ids. If you are missing an _id index, each update requires a full table scan on the secondaries, causing them to lag far behind.

The solution is to create an _id index on the capped collection. However, if you create the index on the primary, but your secondaries are already lagging behind, they won't receive the index creation operation quickly enough. Instead, the best way to fix things is to first fix up each lagging secondary, one-by-one. For each one, shut it down and restart in standalone mode (on a different port, without the --replSet option), build the _id index, then add it back into the set. Finally, once the secondaries are fixed up, you can step down the primary and repeat the process with it too.

Update: In mongoDB 2.0.x and earlier, capped collections do not have an _id index by default. However, the default behavior is scheduled to change in mongoDB 2.2, so that capped collections created in 2.2+ will have an _id index created automatically, just as with non-capped collections. For capped collections created prior to 2.2, you will still need to manually create an _id index using the steps outlined above, but new collections should not suffer from the problems seen above.

like image 89
matulef Avatar answered Oct 01 '22 01:10

matulef