Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Synchronize Data across multiple occasionally-connected-clients using EventSourcing (NodeJS, MongoDB, JSON)

I'm facing a problem implementing data-synchronization between a server and multiple clients. I read about Event Sourcing and I would like to use it to accomplish the syncing-part.

I know that this is not a technical question, more of a conceptional one.

I would just send all events live to the server, but the clients are designed to be used offline from time to time.

This is the basic concept:Visual Concept

The Server stores all events that every client should know about, it does not replay those events to serve the data because the main purpose is to sync the events between the clients, enabling them to replay all events locally.

The Clients have its one JSON store, also keeping all events and rebuilding all the different collections from the stored/synced events.

As clients can modify data offline, it is not that important to have consistent syncing cycles. With this in mind, the server should handle conflicts when merging the different events and ask the specific user in the case of a conflict.

So, the main problem for me is to dertermine the diffs between the client and the server to avoid sending all events to the server. I'm also having trouble with the order of the synchronization process: push changes first, pull changes first?

What I've currently built is a default MongoDB implementation on the serverside, which is isolating all documents of a specific user group in all my queries (Currently only handling authentication and server-side database work). On the client, I've built a wrapper around a NeDB store, enabling me to intercept all query operations to create and manage events per-query, while keeping the default query behaviour intact. I've also compensated for the different ID systems of neDB and MongoDB by implementing custom ids that are generated by the clients and are part of the document data, so that recreating a database won't mess up the IDs (When syncing, these IDs should be consistent across all clients).

The event format will look something like this:

{
   type: 'create/update/remove',
   collection: 'CollectionIdentifier',
   target: ?ID, //The global custom ID of the document updated
   data: {}, //The inserted/updated data
   timestamp: '',
   creator: //Some way to identify the author of the change
}

To save some memory on the clients, I will create snapshots at certain amounts of events, so that fully replaying all events will be more efficient.

So, to narrow down the problem: I'm able to replay events on the client side, I'm also able to create and maintain the events on the client and serverside, Merging the events on serverside should also not be a problem, Also replicating a whole database with existing tools is not an option as I'm only syncing certain parts of the database (Not even entire collections as the documents are assigned different groups in which they should sync).

But what I am having trouble with is:

  • The process of determining what events to send from the client when syncing (Avoid sending duplicate events, or even all events)
  • Determining what events to send back to the client (Avoid sending duplicate events, or even all events)
  • The right order of syncing the events (Push/Pull changes)

Another Question I would like to ask, is whether storing the updates directly on the documents in a revision-like style is more efficient?

If my question is unclear, duplicate (I found some questions, but they didnt help me in my scenario) or something is missing, please leave a comment, I will maintain it as best as I can to keep it simple, as I've just written everything down that could help you understand the concept.

Thanks in advance!

like image 545
Joschua Schneider Avatar asked Feb 28 '17 10:02

Joschua Schneider


2 Answers

This is a very complex subject, but I'll attempt some form of answer.

My first reflex upon seeing your diagram is to think of how distributed databases replicate data between themselves and recover in the event that one node goes down. This is most often accomplished via gossiping.

Gossip rounds make sure that data stays in sync. Time-stamped revisions are kept on both ends merged on demand, say when a node reconnects, or simply at a given interval (publishing bulk updates via socket or the like).

Database engines like Cassandra or Scylla use 3 messages per merge round.

Demonstration:

Data in Node A

{ id: 1, timestamp: 10, data: { foo: '84' } }
{ id: 2, timestamp: 12, data: { foo: '23' } }
{ id: 3, timestamp: 12, data: { foo: '22' } }

Data in Node B

{ id: 1, timestamp: 11, data: { foo: '50' } }
{ id: 2, timestamp: 11, data: { foo: '31' } }
{ id: 3, timestamp: 8, data: { foo: '32' } }

Step 1: SYN

It lists the ids and last upsert timestamps of all it's documents (feel free to change the structure of these data packets, here I'm using verbose JSON to better illustrate the process)

Node A -> Node B

[ { id: 1, timestamp: 10 }, { id: 2, timestamp: 12 }, { id: 3, timestamp: 12 } ]

Step 2: ACK

Upon receiving this packet, Node B compares the received timestamps with it's own. For each documents, if it's timestamp is older, just place it in the ACK payload, if it's newer place it along with it's data. And if timestamps are the same, do nothing- obviously.

Node B -> Node A

[ { id: 1, timestamp: 11, data: { foo: '50' } }, { id: 2, timestamp: 11 }, { id: 3, timestamp: 8 } ]

Step 3: ACK2

Node A updates it's document if ACK data is provided, then sends back the latest data to Node B for those where no ACK data was provided.

Node A -> Node B

[ { id: 2, timestamp: 12, data: { foo: '23' } }, { id: 3, timestamp: 12, data: { foo: '22' } } ]

That way, both node now have the latest data merged both ways (in case the client did offline work) - without having to send all your documents.

In your case, your source of truth is your server, but you could easily implement peer-to-peer gossiping between your clients with WebRTC, for example.

Hope this helps in some way.

Cassandra training video

Scylla explanation

like image 113
NodeNodeNode Avatar answered Oct 23 '22 04:10

NodeNodeNode


I think that the best solution to avoid all the event order and duplication issues are to use the pull method. In this way every client maintains its last imported event state (with a tracker for example) and ask the server for the events generated after that last one.

An interesting problem will be to detect the breaking of business invariants. For that you could store on the client the log of applied commands also and in case of a conflict (events were generated by other clients) you could retry the execution of commands from the command log. You need to do that because some commands will not succeed after re-execution; for example, a client saves a document after other user deleted that document in the same time.

like image 26
Constantin Galbenu Avatar answered Oct 23 '22 04:10

Constantin Galbenu