Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How efficient is Meteor's DDP at syncing very large collections?

Meteor's DDP protocol works very well for syncing a small collection of data from a server to a browser-based client, which inherently limits the amount of data that is processed.

However, consider a situation where Meteor is being used to sync a large collection from one server to another, or just the DDP protocol itself is used to sync one MongoDB with another.

How efficient is DDP in this case (computationally)? How well does it scale to several clients? Is the limit to performance only bandwidth or will DDP hit some CPU bound as well? What is the largest amount of data that can be reasonably synced over DDP right now? Is DDP just the wrong approach for doing this (see references below)?

Some additional thoughts:

  • As far as I know, the current version of DDP keeps track of each client's entire collection, so it can't be asymptotically very efficient.
  • Smart Collections were created to improve the performance of server-to-client collection of syncing. But it's unclear to me if this is improving DDP or something else.

See also:

  • How to implement real-time replication of MongoDB (or CouchDB) to many remote clients
  • DDP vs Straight MongoDB access for synching large amounts of data

EDIT:

After some empirical experience with this, I have to conclude that the answer is "not very efficient". See https://stackoverflow.com/a/21835534/586086 for an explanation.

Discussions with Meteor devs indicated that this problem will be addressed in the future with a revision of DDP and the publish-subscribe API, whereby the merge box will be removed and clients will handle merging. This will save CPU/memory on the server and allow for much larger datasets to be sent over the wire.

like image 973
Andrew Mao Avatar asked Nov 11 '22 22:11

Andrew Mao


1 Answers

Basically it is more a matter of what and how you are publishing to the client than the number of clients. A request is usually handled in log2(N) if indexed, therefore it is quite easy for the server to recompute the result set even if (in the worst case) the whole collection would change. So, from the server side you can quite quickly get the new result sets to publish to the clients (if they changed from the one they had already).

The real problem (and common error) comes when you do publish everything to the client (like with the former autopublish), so make a publication wisely so that you do only give what the client is supposed to see. You can either prune the documents hiding useless fields or reduce the result set to send to the client by creating a publication with specific to your data scope of use parameters.

Data reactivity (session parameter bound on a publication) should also be handled with care, if for example you are sending a request each time you press a key in the search field, you might quickly overload the connection (still strongly depending on the size of the set you are publishing). We had to take care of this trying to build real estate service over meteor, the data set being over several gigabytes it was quite challenging to handle this without blocking the pipe with overloaded data.

In term of bandwidth, the DDP is quite good because it does supports clever entries updating (sending only fields changes instead of the whole document), moving an item is (will be) supported to (server side reordering).

Also take a look on this excellent answer concerning huge collections, what is done under the hood.

like image 170
Flavien Volken Avatar answered Dec 02 '22 11:12

Flavien Volken