Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does CouchDB Replication behave with Failed / Recovered servers?

Consider the following scenario:

3 EC2 instances located in:

  • US-WEST
  • Ireland
  • Tokyo

Each instance is a dedicated CouchDB server. Each CouchDB server is setup to run continuous replication with every other server (bi-directional).

Now assume that the Ireland server goes offline due to some AWS outage. The US-WEST and Tokyo CouchDB servers will retry X number of times and then eventually fail replication with that server (is this correct?)

Lets say 6 hours go by and AWS gets the region back online and that server comes back up -- I assume US-WEST and Tokyo will ignore the server in Ireland until the Irish CouchDB server re-initiates the bi-directional sync with both of them, a la:

Irish CouchDB _replicator Pseudo-Settings

  • replicate[source=localhost,target=us-west]
  • replicate[source=us-west,target=localhost]
  • replicate[source=localhost,target=tokyo]
  • replicate[source=tokyo,target=localhost]

Q1: Is my understanding of Couch's replication failure/recovery correct?

Q2: What if there is a network failure that fixes itself an hour later (specifically: there is no server restart forcing the DB to re-init itself on startup), how do the respective CouchDB instances react to this? I imagine that us-west and tokyo will forget about Ireland, but will Ireland suddenly start talking with those two servers again, re-initializing the bidirectional, continuous replication?

I am specifically interested in failure recovery in the EC2 environment, so if there is a specific detail to that environment I have missed, please let me know.

Thanks!

like image 669
Riyad Kalla Avatar asked Oct 11 '22 06:10

Riyad Kalla


1 Answers

Prior to 1.1, a replication task is not persistent, even a continuous one. In the event of a disconnection, there is a limited attempt at retrying, but eventually it will stop. When connectivity resumes you will need to initiate replication again. Since replication is idempotent (starting the same replication task twice is the same as starting it once), you can just add a cronjob to start it every minute (or whatever interval seems sane to you). If the task is running already, the attempt returns success (but does not start another replication).

In 1.1, you can create persistent replication tasks by creating a document in the special _replicator database. CouchDB will retry this if it crashes or the connection is interrupted. NOTE: 1.1.0 eventually gives up, in the next release (1.1.1) we allow infinite retries.

As CouchDB is designed from the ground up to support multi-master replication, you won't be surprised to hear that it handles interruptions to connectivity very well. The changes that occurred during the interruption are rapidly found and replicated.

like image 109
Robert Newson Avatar answered Oct 12 '22 21:10

Robert Newson