Why every few minutes activity on local.oplog.rs locks mongo clients

Tags:

mongodb

The problem:

Every one or two minutes, the mongo clients halt for about 3 seconds. Normal operation time for updates is about 1 or 2 milliseconds. When this slowness appear, we have a bunch of updates lasting 1 to 3 seconds.

The slow queries log does not show anything related to this. Neither does debugging the mongo client (mongo-php-client).

The current architecture has 1 master, 1 slave and one Arbiter in the replica set.

The queries executed are always the same sort (upsert by _id, insert with new MongoId). There is no "every few minutes we run this super-expensive update"

The blocking seems to be caused by local.oplog.rs. At least, that is what the mongotop output shows below. I haven't found any indication that the secondary is causing this issue, as the outputs of all the following commands seem to be stable there. I haven't found any information pointing that the slowness is caused by a specific query either.

The idea behind the way we store data is pre-aggregated reports. We have a lot of updates (couple of hundreds per second), but a very low rate of queries. The indexes are date bound (except for _id, which is calculated from the composite key based on all the dimensions the record contains. Whit this, i mean that the _id is not incremental, as it would be with an ObjectId index). Just to give an idea, the indexes in the biggest collection are (in Mb):

"indexSizes" : {
    "_id_" : 1967,
    "owner_date" : 230,
    "flow_date" : 231,
    "date" : 170
},

Most of the other collections have indexes of the size of 100Mb or less. In all the collections, the _id index is the biggest one. Is worth noting that this ids are generated manually (based on the metadata so the insertion is done as an upsert, and are not incremental)

FollowUp:

Edit1: After more digging, it seems that the locking is related with the rsync process of the journal, when flushing to disk. The journal is in the same filesystem than the data files, but the slowness in not expected as the disks are fast ssd devices.

Edit2: After some testings, the writing capability of the disks is not an issue. We are usually writing at a rate of several megabytes per second. Running some tests on on the disk it accepts 150mb/s without problem

The expected answer:

Why is this downtime happening?

Pointers to possible reasons, in order to investigate further
Experience/solution based on similar cases

Issue Explanation:

The following commands are run in the primary node

Every time the slowness appears, we see the following in the mongostat (two examples)

insert query update delete getmore command % dirty % used flushes vsize  res qr|qw ar|aw netIn netOut conn set repl     time
    10    *0    141     *0      93   120|0     0.4   80.0       0 11.5G 9.4G   0|1   1|0  110k    71k  274 rs0  PRI 13:15:44
    12    *0    178     *0      72   105|0     0.2   80.0       1 11.5G 9.4G   0|0   1|0  111k    79k  274 rs0  PRI 13:15:45
    47    *0      7     *0       0   159|0     0.1   80.0       0 11.5G 9.4G   0|0   2|1   15k    44k  274 rs0  PRI 13:15:49 !!!!HERE
    14    *0    929     *0      99   170|0     0.2   80.0       0 11.5G 9.4G   0|0   1|0  419k   417k  274 rs0  PRI 13:15:50
    21    *0    287     *0     124   181|0     0.2   80.0       0 11.5G 9.4G   0|0   1|0  187k   122k  274 rs0  PRI 13:15:51

insert query update delete getmore command % dirty % used flushes vsize  res qr|qw ar|aw netIn netOut conn set repl     time
    10    *0    145     *0      70   108|0     0.3   79.9       0 11.5G 9.4G   0|0   1|0   98k    71k  274 rs0  PRI 13:16:48
    11    *0    155     *0      72   111|0     0.2   79.9       1 11.5G 9.4G   0|0   1|0  103k    75k  274 rs0  PRI 13:16:49
    44    *0      3     *0       0   144|0     0.0   79.9       0 11.5G 9.4G   0|2   1|0   11k    75k  274 rs0  PRI 13:16:53 !!!!HERE
    11    *0    837     *0      94   134|0     0.2   79.9       0 11.5G 9.4G   0|0   1|0  377k   348k  274 rs0  PRI 13:16:54
    12    *0    180     *0      86   139|0     0.2   79.9       0 11.5G 9.4G   0|0   1|0  122k    85k  274 rs0  PRI 13:16:55
    14    *0    195     *0      83   124|0     0.2   79.9       0 11.5G 9.4G   0|0   2|0  125k    89k  274 rs0  PRI 13:16:56

the update column has a drop, and the following one much more updates. Note also that we are using mongostat with 1 second delay. When the slowness appears, the mongostat stops replying for few seconds. The stop is only present in the master, not in the slave server

This is the output of mongotop when this problem happens (at 2015-07-07T13:29:38):

(an example with a bit more of context can be found here)

                                                                                        ns    total     read    write    2015-07-07T13:29:33+02:00
                                                           database_name.d_date_flow_owner    555ms    550ms      4ms                             
                                                                            local.oplog.rs     61ms     53ms      7ms                             
                                                      database_name.client_context_bbbbbbb     15ms      0ms     15ms                             
                                                        database_name.d_date_landing_owner     15ms      0ms     15ms                             
                                                        database_name.d_date_billing_owner     10ms      0ms     10ms                             
database_name.w_bl_carrier_country_date_device_flow_landing_manager_op1_os_owner_prod_site      7ms      0ms      7ms                             
                database_name.d_carrier_country_date_device_flow_landing_op1_os_owner_site      5ms      0ms      5ms                             
                                                        database_name.d_country_date_owner      5ms      0ms      5ms                             
                                                         database_name.d_date_device_owner      5ms      0ms      5ms                             
                                                             database_name.d_date_os_owner      5ms      0ms      5ms                             

                                          ns    total    read    write    2015-07-07T13:29:37+02:00
        database_name.client_context_bbbbbbb      2ms     0ms      2ms                             
      database_name.client_context_aaaaaaaaa      1ms     0ms      1ms                             
                   admin.system.backup_users      0ms     0ms      0ms                             
                     admin.system.namespaces      0ms     0ms      0ms                             
                      admin.system.new_users      0ms     0ms      0ms                             
                        admin.system.profile      0ms     0ms      0ms                             
                          admin.system.roles      0ms     0ms      0ms                             
                          admin.system.users      0ms     0ms      0ms                             
                        admin.system.version      0ms     0ms      0ms                             
                               database_name      0ms     0ms      0ms                             

                                                                        ns     total      read     write    2015-07-07T13:29:38+02:00
                                                            local.oplog.rs    8171ms    4470ms    3701ms                             
                                            database_name.d_date_op1_owner      45ms       0ms      45ms                             
                                         database_name.d_date_device_owner      39ms       0ms      39ms                             
                                                database_name.m_date_owner      34ms       0ms      34ms                             
                                                database_name.d_date_owner      32ms       0ms      32ms                             
                                           database_name.d_date_owner_site      31ms       0ms      31ms                             
                                        database_name.d_carrier_date_owner      30ms       0ms      30ms                             
                                           database_name.d_date_flow_owner      30ms       0ms      30ms                             
                                        database_name.d_date_owner_product      28ms       0ms      28ms                             
database_name.d_carrier_country_date_device_flow_landing_op1_os_owner_site      27ms       0ms      27ms                             

                                                                                        ns    total     read    write    2015-07-07T13:29:39+02:00
                                                           database_name.d_date_flow_owner    558ms    552ms      6ms                             
                                                                            local.oplog.rs     62ms     61ms      1ms                             
                                                        database_name.d_carrier_date_owner     17ms      0ms     17ms                             
                                                                database_name.d_date_owner     16ms      0ms     16ms                             
database_name.w_bl_carrier_country_date_device_flow_landing_manager_op1_os_owner_prod_site      7ms      0ms      7ms                             
                                                        database_name.d_date_billing_owner      6ms      0ms      6ms                             
                database_name.d_carrier_country_date_device_flow_landing_op1_os_owner_site      5ms      0ms      5ms                             
                                                        database_name.d_country_date_owner      5ms      0ms      5ms                             
                                                         database_name.d_date_device_owner      5ms      0ms      5ms                             
                                                            database_name.d_date_op1_owner      5ms      0ms      5ms

Debug of the php mongo client that shows the issue (the last two "PhpMongoClient debug" lines):

(an example with a bit more of context can be found here)

Update duration: 2ms
Update duration: 1ms
Update duration: 1ms
Update duration: 4006ms
    PhpMongoClient debug: 2015-07-07 10:40:26 -  PARSE (INFO): Parsing mongodb://primary_host.lan,secondary_host.lan
    PhpMongoClient debug: 2015-07-07 10:40:26 -  PARSE (INFO): - Found node: primary_host.lan:27017
    [....]
    PhpMongoClient debug: 2015-07-07 10:40:26 -  REPLSET (FINE): limiting by credentials: done
    PhpMongoClient debug: 2015-07-07 10:40:26 -  REPLSET (FINE): sorting servers by priority and ping time
    PhpMongoClient debug: 2015-07-07 10:40:26 -  REPLSET (FINE): - connection: type: PRIMARY, socket: 42, ping: 0, hash: primary_host.lan:27017;rs0;database_name/user/5ca571e7db198eeee3abee35857bfd53;30751
    PhpMongoClient debug: 2015-07-07 10:40:26 -  REPLSET (FINE): sorting servers: done
    PhpMongoClient debug: 2015-07-07 10:40:26 -  REPLSET (FINE): selecting near servers
    PhpMongoClient debug: 2015-07-07 10:40:26 -  REPLSET (FINE): selecting near servers: nearest is 0ms
    PhpMongoClient debug: 2015-07-07 10:40:26 -  REPLSET (FINE): - connection: type: PRIMARY, socket: 42, ping: 0, hash: primary_host.lan:27017;rs0;database_name/user/5ca571e7db198eeee3abee35857bfd53;30751
    PhpMongoClient debug: 2015-07-07 10:40:26 -  REPLSET (FINE): selecting near server: done
    PhpMongoClient debug: 2015-07-07 10:40:26 -  REPLSET (INFO): pick server: random element 0
    PhpMongoClient debug: 2015-07-07 10:40:26 -  REPLSET (INFO): - connection: type: PRIMARY, socket: 42, ping: 0, hash: primary_host.lan:27017;rs0;database_name/user/5ca571e7db198eeee3abee35857bfd53;30751
    PhpMongoClient debug: 2015-07-07 10:40:26 -  CON (FINE): No timeout changes for primary_host.lan:27017;rs0;database_name/user/5ca571e7db198eeee3abee35857bfd53;30751
    PhpMongoClient debug: 2015-07-07 10:40:30 -  CON (FINE): No timeout changes for primary_host.lan:27017;rs0;database_name/user/5ca571e7db198eeee3abee35857bfd53;30751
Update duration: 3943ms
Update duration: 3476ms
Update duration: 2008ms
Update duration: 961ms
Update duration: 956ms
Update duration: 20ms
Update duration: 20ms
Update duration: 3ms
Update duration: 42ms
Update duration: 24ms
Update duration: 25ms
Update duration: 56ms
Update duration: 24ms
Update duration: 11ms
Update duration: 11ms
Update duration: 3ms
Update duration: 2ms
Update duration: 3ms
Update duration: 1ms
Update duration: 1ms
Update duration: 1ms
Update duration: 2ms

Mongo Information:

Mongo Version: 3.0.3
Replica set with 1 slave and 1 arbiter
Replication lag varies between 0 and 4 seconds
Engine: WiredTiger
Filesystem: XFS
Operative System: Red Hat Enterprise Linux Server release 7.1
Memory: 24Gb. Reported by htop as 40% used, 60% cache

595

asked Jul 07 '15 13:07

epilgrim

1 Answers

This issue has now disappeared. Two actions were undertaken:

Reworked the pre-aggregated reports system. The workload of mongo was reduced by 10 times.
Updated version of mongo to 3.0.6

Unfortunately, the two changes were put online with no much time in between them. I have the suspicion that reducing the workload did the trick (which might or might not be linked to the issue that @steve-brisk pointed out) at least for now (we'll see when we hit again the previous workload levels). But, as the version is also updated, it might be that even with the previous levels of workload we might not hit this issue again

I have no evidence pointing to one of the two solutions alone. But after taking both of them, the issue is resolved

answered Sep 24 '22 03:09

epilgrim

Related questions
                            
                                Mongo Query with Regex in Node JS operating ON A VARIABLE
                            
                                How do I set default value of an integer in mongodb?
                            
                                Using $regex mongodb query in node js not working
                            
                                How do you do wildcard searches with Mongoid in a Ruby on Rails environment?
                            
                                Make mongoose string schema type default value as blank and make the field optional
                            
                                how to check from a driver, if mongoDB server is running
                            
                                How to join two collections in mongoose
                            
                                Remove old records in mongodb based on Month
                            
                                Java syntax with MongoDB
                            
                                Mongoexport using $gt and $lt constraints on a date range
                            
                                Mocking/stubbing Mongoose model save method
                            
                                Mongo is unable to start
                            
                                python mongodb regex: ignore case
                            
                                Unable to connect to mongodb Error: couldn't connect to server 127.0.0.1:27017 at src/mongo/shell/mongo.js:L112
                            
                                Springboot With Mongodb Error while using find*() query
                            
                                Log MongoDB queries with Spring Boot
                            
                                can't start mongodb as sudo
                            
                                Using MongoDB's Geospacial index with 3d data
                            
                                ActionDispatch::Routing::RouteSet#call Rails 4.1 really slow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why every few minutes activity on local.oplog.rs locks mongo clients

Tags:

mongodb

The problem:

FollowUp:

The expected answer:

Issue Explanation:

Mongo Information:

epilgrim

People also ask

1 Answers

epilgrim

Recent Activity

Donate For Us