I am thinking of implementing a MongoDB change stream reader and I want to make sure I'm doing it correctly. There are plenty of simple examples out there of how to implement the actual reader code, including the official documentation, and I'm not too worried about that aspect of it.
I am however a little worried about the reader falling behind the change stream and not being able to catch up and I want to make sure the reader can handle the flow.
The mongo server is a cluster and lets assume for the sake of argument that it is quite busy at all times of day. The change stream API appears to only be compatible with a single instance doing the work given how it must iterate the stream results rather than operate on it like a queue. Therefore I am worried that it is possible that the single instance iterating the results could take longer to do its work than new items are pushed into the stream.
My instinct is to actually have the reader simply read the stream, batch the changes and then push it into a queue where other workers can then horizontally scale to do the work. However I still have a single instance as the reader and its still theoretically possible for it to fall behind the stream, even while only doing the bare minimum work of pushing changes into a queue.
So my questions are, how realistic of a worry is this and is there any way to create the reader in such a way that it can horizontally scale even if it is only streaming the changes into a worker queue? What other considerations should I take into account?
Most likely a single reader can probably suffice by simply delegating all of the work to a horizontally scaled queue.
If it turns out that that is insufficient and your reader still needs to horizontally scale then you may be able to achieve that by using a match filter in such a way that it would allow multiple readers to divide up the work.
For example if you have an id with hexidecimal characters you could split up the work onto two servers by using a match operator on each server where each server matches on half of the characters in the full range:
// Change Stream Reader 1
const params = [
{ $match: { _id: /^[0-7]/ } }
];
const collection = db.collection('inventory');
const changeStream = collection.watch(params);
The on a second machine:
// Change Stream Reader 2
const params = [
{ $match: { _id: /^[8-9a-f]/ } }
];
const collection = db.collection('inventory');
const changeStream = collection.watch(params);
If you need to have more than 16 servers then you can make the range even more specific:
// Server 0 matches on /^0[0-7]/
// Server 1 matches on /^1/
// ...
// Server 15 matches on /^f/
// Server 16 matches on /^0[8-9a-f]/
This will allow each machine to watch a subset of messages and process them while other machines are processing other messages without duplication.
Coordinating which server is watching which range in a robust way becomes somewhat complex as you need to ensure that a crashed or hung machine resumes and if you need to dynamically scale horizontally then you need to be able to deliver new ranges to the servers and it resizes. This solution will also cause the messages to be processed out of order, so if order is important then you'll need to come up with a solution for re-ordering messages or dealing with out of sequence issues.
But these are all different topics from this question so I will leave out the details for now.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With