Our application uses CouchDB filtered replications to move data between user databases and a master database. As we increase the number of users, replications start failing with this message
Source and target databases out of sync. Try to increase max_dbs_open at both servers.
We've done that, increasing the number of max_dbs_open to a ridiculously high number (10,000) but the failures and messages remain the same. Obviously something else is wrong. Does anyone know what it is?
As it turns out, the message to increase max_dbs_open
is at best a partial answer and at worst is misleading. In our case the problem wasn't the number of databases that were open but apparently the number of HTTP connections used by our many replications.
Each replication can use min(worker_processes + 1, http_connections)
where worker_processes
are the number of workers assigned to each replication and http_connections
is the maximum number of HTTP connections allotted for each replication as described in this document.
So the total number of connections used is
number of replications * min(worker_processes + 1, http_connections)
The default value of worker_processes
is 4 and the default value of http_connections
is 20. If there are 100 replications, the total number of HTTP connections used by replication is 500. Another setting, max_connections
, determines the maximum number of HTTP connections a CouchDB server will allow as described in this document. The default is 2048.
In our case each user has two replications -- one from the user to the master database and another from the master database to the user. So, in our case, with the default settings, each time we added a user we were adding an additional 10 HTTP connections eventually blowing through the default max_connections
.
Since our replications are minimal and only a small amount of data is moved from the user to the master and from the master to the user, we dialed back the number of worker_processes
, http_connections
, increased max_connections
and all is well.
UPDATE
A couple of other findings
It was necessary to raise the ulimit on the process to allow it to have more open connections
Creating replications too quickly also caused problems. If I dialed back how quickly I created new replications it also helped ease the problem. ymmv.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With